Multiple sequence alignment: Difference between revisions

Content deleted Content added
No edit summary
major rewrite/expansion, destub - markov models and genetic algorithms to come
Line 1:
[[Image:RPLP0_90_ClustalW_aln.gif|right|thumb|300px|First 90 positions of a protein multiple sequence alignment of instances of the acidic ribosomal protein P0 (L10E) from several organisms. Generated with [[ClustalW]].]]
 
A '''multiple sequence alignment (MSA)''' is a [[sequence alignment]] of three or more [[biological sequences]], generally [[protein]], [[DNA]], or [[RNA]]. In general, the input set of query sequences are assumed to have an [[evolution]]ary relationship by which they share a lineage and are descended from a common ancestor. From the resulting MSA, sequence [[homology (biology)|homology]] can be inferred and [[molecular phylogeny|phylogenetic analysis]] can be conducted to assess the sequences' shared evolutionary origins. Visual depictions of the alignment as in the image at right illustrate [[mutation]] events such as point mutations (single [[amino acid]] or [[nucleotide]] changes) that appear as differing characters in a single alignment column, and insertion or deletion mutations (or [[indel]]s) that appear as gaps in one or more of the sequences in the alignment. Multiple sequence alignment is often used to assess sequence [[conservation (genetics)|conservation]] of [[protein ___domain]]s, [[tertiary structure|tertiary]] and [[secondary structure|secondary]] structures, and even individual amino acids or nucleotides.
A '''multiple sequence alignment''' is a [[sequence alignment]] of three or more [[biological sequence]]s.
 
"Multiple sequence alignment" also refers to the process of aligning such a sequence set. Because three or more sequences of biologically relevant length are nearly impossible to align by hand, computational [[algorithm]]s are used to produce and analyze the alignments. MSAs require more sophisticated methodologies than [[sequence alignment|pairwise alignment]] because they are more computationally complex to produce. Most multiple sequence alignment programs use [[heuristic]] methods rather than [[global optimization]] because identifying the optimal alignment between more than a few sequences of moderate length is prohibitively computationally expensive.
The multiple alignment problem consists of inferring all [[Homology (biology)|homologous]] characters among these sequences. The characters may consist of single [[Nucleotide|
nucleotides]], [[Amino acid|amino acids]], genes, or any sequence segments that may share an evolutionary origin.
Multiple alignments may be used to study which sequences have been conserved over time. This is the starting point of [[Comparative genomics|comparative genomics]] and [[Molecular phylogeny| molecular phylogenetics]] studies. The theoretical basis for multiple sequence alignments is that the sequences have evolved by [[point mutation]]s, [[deletion mutation]]s or [[gene insertion|insertion mutation]]s. Point mutations would result in the alignment having differing characters in the same column of the alignment, while deletions and insertions would have gaps in the columns effected by the insertion or deletion.
 
==Dynamic programming and computational complexity==
==Multiple sequence alignment programs and techniques==
The most direct method for producing an MSA uses the [[dynamic programming]] technique to identify the globally optimal alignment solution. For proteins, this method usually involves two sets of parameters: a [[gap penalty]] and a [[substitution matrix]] assigning scores or probabilities to the alignment of each possible pair of amino acids based on the similarity of the amino acids' chemical properties and the evolutionary probability of the mutation. For nucleotide sequences a substitution matrix can be used, but since there are only four possible standard characters per sequence and the individual nucleotides do not typically differ much in substitution probability, the parameters for DNA and RNA sequences usually consist of a gap penalty, a positive score for character matches, and a negative score for mismatches.
 
For ''n'' individual sequences, the method requires constructing the ''n''-dimensional equivalent of the matrix formed in standard pairwise dynamic programming. The search space thus increases exponentially with increasing ''n'' and is also strongly dependent on sequence length. To find the global optimum for ''n'' sequences this way has been shown to be an [[NP-hard]] problem{{ref|Wang}}{{ref|Just}}. Methods to reduce the search space by first performing pairwise dynamic programming on each pair of sequences in the query set and searching only the solution space near these results (effectively finding the intersection between local paths immediately surrounding each pairwise optimum solution) render the dynamic programming technique more efficient. The so-called "sum of pairs" method has been implemented in the software package [http://www.ncbi.nlm.nih.gov/CBBresearch/Schaffer/msa.html MSA], but it is still impractical for many MSA applications that require the simultaneous alignment of dozens or even a few hundred sequences. Dyanmic programming methods are now used mainly as [[benchmark]]ing standards in evaluating new or refined heuristic techniques.
A common approach for multiple sequence alignment is to progressively align sequences using a guide tree. Initially, each sequence at the leaves is represented as a trivial alignment of a single sequence. Then, at each internal node, the alignments at its children are merged into an alignment of the alignments. At the end, the root contains an alignment on all the sequences. This is called ''progressive alignment''. Usually, these alignemnts which occur at the interior nodes are done as [[pairwise alignment]]s where each of the two children alignments is treated as a single sequence. Two sequences can be aligned using dynamic programming techniques such as [[Needleman-Wunsch]] and a [[substitution matrix]] such as [[BLOSUM]] or [[Margaret Dayhoff]]'s [[Point accepted mutation|PAM]] (Point Accepted Mutation).
 
==Progressive alignment construction==
There are many computer programs to produce multiple sequence alignments starting with a collection of unaligned sequences ([[Clustal|ClustalW]], [[DIALIGN]], [[MAVID]], [[Threaded Blockset Aligner|TBA]], [[T-Coffee]]) and to represent graphically those alignments ([[Clustal|ClustalW]], [http://www.sanger.ac.uk/Software/Pfam/help/belvu_setup.shtml Belvu]).
One method of performing a heuristic alignment search is the progressive technique (also known as the hierarchical or tree method) that builds up a final MSA by first performing a series of pairwise alignments on successively less closely related sequences. Such methods begin by aligning the two most closely related sequences first and then successively aligning the next most closely related sequence in the query set to the alignment produced in the previous step. The initial "most related" pair is determined by an efficient [[clustering]] method such as [[neighbor-joining]] based on a simple heuristic search of the query set with a tool like [[FASTA]]. Progressive techniques therefore automatically construct a [[phylogenetic tree]] as well as an alignment.
 
One major limitation of progressive methods is their heavy dependence on the initial assignment of relatedness and on the quality of the initial alignment. The methods are thus sensitive as well to the distribution of sequences in the query set; performance improves when relatedness among query sequences is a relatively smooth gradient rather than distantly separated clusters. Performance also degrades significantly when all of the sequences in the set are rather distantly related, because inaccuracies in the initial alignment are then more likely. Most modern progressive methods modify their scoring function with a secondary weighting function that assigns scaling factors to individual members of the query set in a nonlinear fashion based on their phylogenetic distance from their nearest neighbors. Judicious choice of weighting can aid in evaluating relatedness and mitigate the effects of relatively poor initial alignments early in the progression.
 
Progressive alignment methods are efficient enough to implement on a large scale for many sequences and are often run on publicly accessible web servers so users need not locally install the applications of interest. A very popular progressive alignment method is the [[Clustal]] family{{ref|Higgins}}, especially the weighted variant ClustalW{{ref|Thomson}} to which access is provided by a large number of web portals including [http://align.genome.jp/ GenomeNet], [http://www.ebi.ac.uk/clustalw/ EBI], and [http://www.ch.embnet.org/software/ClustalW.html EMBNet]. Different portals or implementations can vary in user interface and make different parameters accessible to the user. Clustal is used extensively for phylogenetic tree construction and as input for [[protein structure prediction]] by homology modeling.
 
Another common progressive alignment method called [[T-Coffee]]{{ref|Notredame}} is slower than Clustal and its derivatives but generally produces more accurate alignments for distantly related sequence sets. T-coffee uses the output from Clustal as well as another local alignment program LALIGN, which finds multiple reqions of local alignment between two sequences. The resulting alignment and phylogenetic tree are used as a guide to produce new and more accurate weighting factors.
 
Because progressive methods are heuristics that are not guaranteed to converge to a global optimum, alignment quality can be difficult to evaluate and their true biological significance can be obscure. A very recent semi-progressive method that improves alignment quality and does not use a lossy heuristic while still running in [[polynomial time]]{{ref|Sze}} has been implemented in the program [http://faculty.cs.tamu.edu/shsze/psalign PSAlign].
 
==Iterative methods==
A set of methods to produce MSAs while reducing the errors inherent in progressive methods are classified as "iterative" because they work similarly to progressive methods but repeatedly realign the initial sequences as well as adding new sequences to the growing MSA. One reason progressive methods are so strongly dependent on a high-quality initial alignment is the fact that these alignments are always incorporated into the final result - that is, once a sequence has been aligned into the MSA, its alignment is not considered further. This approximation improves efficiency at the cost of accuracy. By contrast, iterative methods can return to previously calculated pairwise alignments or sub-MSAs incorporating subsets of the query sequence as a means of optimizing a general [[objective function]] such as finding a high-quality alignment score.
 
A variety of subtly different iteration methods have been implemented and made available in software packages; reviews and comparisons have been useful but generally refrain from choosing a "best" technique{{ref|Hirosawa}}. The software package [http://prrn.hgc.jp/ PRRN/PRRP] uses a [[hill-climbing algorithm]] to optimize its MSA alignment score{{ref|Gotoh}} and iteratively corrects both alignment weights and locally divergent or "gappy" regions of the growing MSA{{ref|Mount}}. PRRP performs best when refining an alignment previously constructed by a faster method.
 
Another iterative program, DIALIGN, takes an unusual approach of focusing narrowly on local alignments between sub-segments or [[sequence motif]]s without introducing a gap penalty{{ref|Brudno}}. The alignment of individual motifs is then achieved with a matrix representation similar to a dot-matrix plot in a pairwise alignment. DIALIGN is also available as a web portal at [http://dialign.gobics.de/chaos-dialign-submission CHAOS/DIALIGN].
 
A third popular iteration-based method called MUSCLE (multiple sequence alignment by log-expectation) improves on progressive methods with a more accurate distance measure to assess the relatedness of two sequences{{ref|Edgar}}. The distance measure is updated between iteration stages (although, in its original form, MUSCLE contained only 2-3 iterations depending on whether refinement was enabled). A web portal and download site is available at [http://www.drive5.com/muscle/ MUSCLE].
 
== References ==
#{{note|Wang}} Wang L, Jiang T. (1994) On the complexity of multiple sequence alignment. ''J Comput Biol'' 1:337-348.
#{{note|Just}} Just W. (2001). Computational complexity of multiple sequence alignment with SP-score. ''J Comput Biol'' 8(6):615-23.
#{{note|Higgins}} Higgins DG, Sharp PM. (1988). CLUSTAL: a package for performing multiple sequence alignment on a microcomputer. ''Gene'' 73(1):237-44.
#{{note|Thomson}} Thompson JD, Higgins DG, Gibson TJ. (1994). CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice. ''Nucleic Acids Res'' 22:4673-4680.
#{{note|Notredame}} Notredame C, Higgins DG, Heringa J. (2000). T-Coffee: A novel method for fast and accurate multiple sequence alignment. ''J Mol Biol'' 302(1):205-17.
#{{note|Sze}} Sze SH, Lu Y, Yang Q. (2006). A polynomial time solvable formulation of multiple sequence alignment. ''J Comput Biol'' 13(2):309-19.
#{{note|Hirosawa}} Hirosawa M, Totoki Y, Hoshida M, Ishikawa M. (1995). Comprehensive study on iterative algorithms of multiple sequence alignment. ''Comput Appl Biosci'' 11:13-18.
#{{note|Gotoh}} Gotoh O. (1996). Significant improvement in accuracy of multiple protein sequence alignments by iterative refinement as assessed by reference to structural alignments. ''J Mol Biol'' 264(4):823-38.
#{{note|Mount}} Mount DM. (2004). Bioinformatics: Sequence and Genome Analysis 2nd ed. Cold Spring Harbor Laboratory Press: Cold Spring Harbor, NY.
#{{note|Brudno}} Brudno M, Chapman M, Göttgens B, Batzoglou S, Morgenstern B. (2003) Fast and sensitive multiple alignment of large genomic sequences. ''BMC Bioinformatics'' 4:66.
#{{note|Edgar}} Edgar RC. (2004), MUSCLE: multiple sequence alignment with high accuracy and high throughput. ''Nucleic Acids Research'' 32(5), 1792-97.
 
=== Survey articles ===
Line 51 ⟶ 79:
| pages = 12682--2690
}}
 
{{bioinformatics-stub}}
 
[[Category:Bioinformatics]]