Content deleted Content added
No edit summary |
major rewrite/expansion, destub - markov models and genetic algorithms to come |
||
Line 1:
[[Image:RPLP0_90_ClustalW_aln.gif|right|thumb|300px|First 90 positions of a protein multiple sequence alignment of instances of the acidic ribosomal protein P0 (L10E) from several organisms. Generated with [[ClustalW]].]]
A '''multiple sequence alignment (MSA)''' is a [[sequence alignment]] of three or more [[biological sequences]], generally [[protein]], [[DNA]], or [[RNA]]. In general, the input set of query sequences are assumed to have an [[evolution]]ary relationship by which they share a lineage and are descended from a common ancestor. From the resulting MSA, sequence [[homology (biology)|homology]] can be inferred and [[molecular phylogeny|phylogenetic analysis]] can be conducted to assess the sequences' shared evolutionary origins. Visual depictions of the alignment as in the image at right illustrate [[mutation]] events such as point mutations (single [[amino acid]] or [[nucleotide]] changes) that appear as differing characters in a single alignment column, and insertion or deletion mutations (or [[indel]]s) that appear as gaps in one or more of the sequences in the alignment. Multiple sequence alignment is often used to assess sequence [[conservation (genetics)|conservation]] of [[protein ___domain]]s, [[tertiary structure|tertiary]] and [[secondary structure|secondary]] structures, and even individual amino acids or nucleotides.
"Multiple sequence alignment" also refers to the process of aligning such a sequence set. Because three or more sequences of biologically relevant length are nearly impossible to align by hand, computational [[algorithm]]s are used to produce and analyze the alignments. MSAs require more sophisticated methodologies than [[sequence alignment|pairwise alignment]] because they are more computationally complex to produce. Most multiple sequence alignment programs use [[heuristic]] methods rather than [[global optimization]] because identifying the optimal alignment between more than a few sequences of moderate length is prohibitively computationally expensive.
==Dynamic programming and computational complexity==
The most direct method for producing an MSA uses the [[dynamic programming]] technique to identify the globally optimal alignment solution. For proteins, this method usually involves two sets of parameters: a [[gap penalty]] and a [[substitution matrix]] assigning scores or probabilities to the alignment of each possible pair of amino acids based on the similarity of the amino acids' chemical properties and the evolutionary probability of the mutation. For nucleotide sequences a substitution matrix can be used, but since there are only four possible standard characters per sequence and the individual nucleotides do not typically differ much in substitution probability, the parameters for DNA and RNA sequences usually consist of a gap penalty, a positive score for character matches, and a negative score for mismatches.
For ''n'' individual sequences, the method requires constructing the ''n''-dimensional equivalent of the matrix formed in standard pairwise dynamic programming. The search space thus increases exponentially with increasing ''n'' and is also strongly dependent on sequence length. To find the global optimum for ''n'' sequences this way has been shown to be an [[NP-hard]] problem{{ref|Wang}}{{ref|Just}}. Methods to reduce the search space by first performing pairwise dynamic programming on each pair of sequences in the query set and searching only the solution space near these results (effectively finding the intersection between local paths immediately surrounding each pairwise optimum solution) render the dynamic programming technique more efficient. The so-called "sum of pairs" method has been implemented in the software package [http://www.ncbi.nlm.nih.gov/CBBresearch/Schaffer/msa.html MSA], but it is still impractical for many MSA applications that require the simultaneous alignment of dozens or even a few hundred sequences. Dyanmic programming methods are now used mainly as [[benchmark]]ing standards in evaluating new or refined heuristic techniques.
==Progressive alignment construction==
One method of performing a heuristic alignment search is the progressive technique (also known as the hierarchical or tree method) that builds up a final MSA by first performing a series of pairwise alignments on successively less closely related sequences. Such methods begin by aligning the two most closely related sequences first and then successively aligning the next most closely related sequence in the query set to the alignment produced in the previous step. The initial "most related" pair is determined by an efficient [[clustering]] method such as [[neighbor-joining]] based on a simple heuristic search of the query set with a tool like [[FASTA]]. Progressive techniques therefore automatically construct a [[phylogenetic tree]] as well as an alignment.
One major limitation of progressive methods is their heavy dependence on the initial assignment of relatedness and on the quality of the initial alignment. The methods are thus sensitive as well to the distribution of sequences in the query set; performance improves when relatedness among query sequences is a relatively smooth gradient rather than distantly separated clusters. Performance also degrades significantly when all of the sequences in the set are rather distantly related, because inaccuracies in the initial alignment are then more likely. Most modern progressive methods modify their scoring function with a secondary weighting function that assigns scaling factors to individual members of the query set in a nonlinear fashion based on their phylogenetic distance from their nearest neighbors. Judicious choice of weighting can aid in evaluating relatedness and mitigate the effects of relatively poor initial alignments early in the progression.
Progressive alignment methods are efficient enough to implement on a large scale for many sequences and are often run on publicly accessible web servers so users need not locally install the applications of interest. A very popular progressive alignment method is the [[Clustal]] family{{ref|Higgins}}, especially the weighted variant ClustalW{{ref|Thomson}} to which access is provided by a large number of web portals including [http://align.genome.jp/ GenomeNet], [http://www.ebi.ac.uk/clustalw/ EBI], and [http://www.ch.embnet.org/software/ClustalW.html EMBNet]. Different portals or implementations can vary in user interface and make different parameters accessible to the user. Clustal is used extensively for phylogenetic tree construction and as input for [[protein structure prediction]] by homology modeling.
Another common progressive alignment method called [[T-Coffee]]{{ref|Notredame}} is slower than Clustal and its derivatives but generally produces more accurate alignments for distantly related sequence sets. T-coffee uses the output from Clustal as well as another local alignment program LALIGN, which finds multiple reqions of local alignment between two sequences. The resulting alignment and phylogenetic tree are used as a guide to produce new and more accurate weighting factors.
Because progressive methods are heuristics that are not guaranteed to converge to a global optimum, alignment quality can be difficult to evaluate and their true biological significance can be obscure. A very recent semi-progressive method that improves alignment quality and does not use a lossy heuristic while still running in [[polynomial time]]{{ref|Sze}} has been implemented in the program [http://faculty.cs.tamu.edu/shsze/psalign PSAlign].
==Iterative methods==
A set of methods to produce MSAs while reducing the errors inherent in progressive methods are classified as "iterative" because they work similarly to progressive methods but repeatedly realign the initial sequences as well as adding new sequences to the growing MSA. One reason progressive methods are so strongly dependent on a high-quality initial alignment is the fact that these alignments are always incorporated into the final result - that is, once a sequence has been aligned into the MSA, its alignment is not considered further. This approximation improves efficiency at the cost of accuracy. By contrast, iterative methods can return to previously calculated pairwise alignments or sub-MSAs incorporating subsets of the query sequence as a means of optimizing a general [[objective function]] such as finding a high-quality alignment score.
A variety of subtly different iteration methods have been implemented and made available in software packages; reviews and comparisons have been useful but generally refrain from choosing a "best" technique{{ref|Hirosawa}}. The software package [http://prrn.hgc.jp/ PRRN/PRRP] uses a [[hill-climbing algorithm]] to optimize its MSA alignment score{{ref|Gotoh}} and iteratively corrects both alignment weights and locally divergent or "gappy" regions of the growing MSA{{ref|Mount}}. PRRP performs best when refining an alignment previously constructed by a faster method.
Another iterative program, DIALIGN, takes an unusual approach of focusing narrowly on local alignments between sub-segments or [[sequence motif]]s without introducing a gap penalty{{ref|Brudno}}. The alignment of individual motifs is then achieved with a matrix representation similar to a dot-matrix plot in a pairwise alignment. DIALIGN is also available as a web portal at [http://dialign.gobics.de/chaos-dialign-submission CHAOS/DIALIGN].
A third popular iteration-based method called MUSCLE (multiple sequence alignment by log-expectation) improves on progressive methods with a more accurate distance measure to assess the relatedness of two sequences{{ref|Edgar}}. The distance measure is updated between iteration stages (although, in its original form, MUSCLE contained only 2-3 iterations depending on whether refinement was enabled). A web portal and download site is available at [http://www.drive5.com/muscle/ MUSCLE].
== References ==
#{{note|Wang}} Wang L, Jiang T. (1994) On the complexity of multiple sequence alignment. ''J Comput Biol'' 1:337-348.
#{{note|Just}} Just W. (2001). Computational complexity of multiple sequence alignment with SP-score. ''J Comput Biol'' 8(6):615-23.
#{{note|Higgins}} Higgins DG, Sharp PM. (1988). CLUSTAL: a package for performing multiple sequence alignment on a microcomputer. ''Gene'' 73(1):237-44.
#{{note|Thomson}} Thompson JD, Higgins DG, Gibson TJ. (1994). CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice. ''Nucleic Acids Res'' 22:4673-4680.
#{{note|Notredame}} Notredame C, Higgins DG, Heringa J. (2000). T-Coffee: A novel method for fast and accurate multiple sequence alignment. ''J Mol Biol'' 302(1):205-17.
#{{note|Sze}} Sze SH, Lu Y, Yang Q. (2006). A polynomial time solvable formulation of multiple sequence alignment. ''J Comput Biol'' 13(2):309-19.
#{{note|Hirosawa}} Hirosawa M, Totoki Y, Hoshida M, Ishikawa M. (1995). Comprehensive study on iterative algorithms of multiple sequence alignment. ''Comput Appl Biosci'' 11:13-18.
#{{note|Gotoh}} Gotoh O. (1996). Significant improvement in accuracy of multiple protein sequence alignments by iterative refinement as assessed by reference to structural alignments. ''J Mol Biol'' 264(4):823-38.
#{{note|Mount}} Mount DM. (2004). Bioinformatics: Sequence and Genome Analysis 2nd ed. Cold Spring Harbor Laboratory Press: Cold Spring Harbor, NY.
#{{note|Brudno}} Brudno M, Chapman M, Göttgens B, Batzoglou S, Morgenstern B. (2003) Fast and sensitive multiple alignment of large genomic sequences. ''BMC Bioinformatics'' 4:66.
#{{note|Edgar}} Edgar RC. (2004), MUSCLE: multiple sequence alignment with high accuracy and high throughput. ''Nucleic Acids Research'' 32(5), 1792-97.
=== Survey articles ===
Line 51 ⟶ 79:
| pages = 12682--2690
}}
[[Category:Bioinformatics]]
|