Multiple sequence alignment

A multiple sequence alignment (MSA) is a sequence alignment of three or more biological sequences, generally protein, DNA, or RNA. In general, the input set of query sequences are assumed to have an evolutionary relationship by which they share a lineage and are descended from a common ancestor. From the resulting MSA, sequence homology can be inferred and phylogenetic analysis can be conducted to assess the sequences' shared evolutionary origins. Visual depictions of the alignment as in the image at right illustrate mutation events such as point mutations (single amino acid or nucleotide changes) that appear as differing characters in a single alignment column, and insertion or deletion mutations (or indels) that appear as gaps in one or more of the sequences in the alignment. Multiple sequence alignment is often used to assess sequence conservation of protein domains, tertiary and secondary structures, and even individual amino acids or nucleotides.

"Multiple sequence alignment" also refers to the process of aligning such a sequence set. Because three or more sequences of biologically relevant length are nearly impossible to align by hand, computational algorithms are used to produce and analyze the alignments. MSAs require more sophisticated methodologies than pairwise alignment because they are more computationally complex to produce. Most multiple sequence alignment programs use heuristic methods rather than global optimization because identifying the optimal alignment between more than a few sequences of moderate length is prohibitively computationally expensive.

Dynamic programming and computational complexity

The most direct method for producing an MSA uses the dynamic programming technique to identify the globally optimal alignment solution. For proteins, this method usually involves two sets of parameters: a gap penalty and a substitution matrix assigning scores or probabilities to the alignment of each possible pair of amino acids based on the similarity of the amino acids' chemical properties and the evolutionary probability of the mutation. For nucleotide sequences a substitution matrix can be used, but since there are only four possible standard characters per sequence and the individual nucleotides do not typically differ much in substitution probability, the parameters for DNA and RNA sequences usually consist of a gap penalty, a positive score for character matches, and a negative score for mismatches.

For n individual sequences, the method requires constructing the n-dimensional equivalent of the matrix formed in standard pairwise dynamic programming. The search space thus increases exponentially with increasing n and is also strongly dependent on sequence length. To find the global optimum for n sequences this way has been shown to be an NP-hard problem^[1]^[2]. Methods to reduce the search space by first performing pairwise dynamic programming on each pair of sequences in the query set and searching only the solution space near these results (effectively finding the intersection between local paths immediately surrounding each pairwise optimum solution) render the dynamic programming technique more efficient. The so-called "sum of pairs" method has been implemented in the software package MSA, but it is still impractical for many MSA applications that require the simultaneous alignment of dozens or even a few hundred sequences. Dyanmic programming methods are now used mainly as benchmarking standards in evaluating new or refined heuristic techniques.

Progressive alignment construction

One method of performing a heuristic alignment search is the progressive technique (also known as the hierarchical or tree method) that builds up a final MSA by first performing a series of pairwise alignments on successively less closely related sequences. Such methods begin by aligning the two most closely related sequences first and then successively aligning the next most closely related sequence in the query set to the alignment produced in the previous step. The initial "most related" pair is determined by an efficient clustering method such as neighbor-joining based on a simple heuristic search of the query set with a tool like FASTA. Progressive techniques therefore automatically construct a phylogenetic tree as well as an alignment.

One major limitation of progressive methods is their heavy dependence on the initial assignment of relatedness and on the quality of the initial alignment. The methods are thus sensitive as well to the distribution of sequences in the query set; performance improves when relatedness among query sequences is a relatively smooth gradient rather than distantly separated clusters. Performance also degrades significantly when all of the sequences in the set are rather distantly related, because inaccuracies in the initial alignment are then more likely. Most modern progressive methods modify their scoring function with a secondary weighting function that assigns scaling factors to individual members of the query set in a nonlinear fashion based on their phylogenetic distance from their nearest neighbors. Judicious choice of weighting can aid in evaluating relatedness and mitigate the effects of relatively poor initial alignments early in the progression.

Progressive alignment methods are efficient enough to implement on a large scale for many sequences and are often run on publicly accessible web servers so users need not locally install the applications of interest. A very popular progressive alignment method is the Clustal family^[3], especially the weighted variant ClustalW^[4] to which access is provided by a large number of web portals including GenomeNet, EBI, and EMBNet. Different portals or implementations can vary in user interface and make different parameters accessible to the user. Clustal is used extensively for phylogenetic tree construction and as input for protein structure prediction by homology modeling.

Another common progressive alignment method called T-Coffee^[5] is slower than Clustal and its derivatives but generally produces more accurate alignments for distantly related sequence sets. T-coffee uses the output from Clustal as well as another local alignment program LALIGN, which finds multiple reqions of local alignment between two sequences. The resulting alignment and phylogenetic tree are used as a guide to produce new and more accurate weighting factors.

Because progressive methods are heuristics that are not guaranteed to converge to a global optimum, alignment quality can be difficult to evaluate and their true biological significance can be obscure. A very recent semi-progressive method that improves alignment quality and does not use a lossy heuristic while still running in polynomial time^[6] has been implemented in the program PSAlign.

Iterative methods

A set of methods to produce MSAs while reducing the errors inherent in progressive methods are classified as "iterative" because they work similarly to progressive methods but repeatedly realign the initial sequences as well as adding new sequences to the growing MSA. One reason progressive methods are so strongly dependent on a high-quality initial alignment is the fact that these alignments are always incorporated into the final result - that is, once a sequence has been aligned into the MSA, its alignment is not considered further. This approximation improves efficiency at the cost of accuracy. By contrast, iterative methods can return to previously calculated pairwise alignments or sub-MSAs incorporating subsets of the query sequence as a means of optimizing a general objective function such as finding a high-quality alignment score.

A variety of subtly different iteration methods have been implemented and made available in software packages; reviews and comparisons have been useful but generally refrain from choosing a "best" technique^[7]. The software package PRRN/PRRP uses a hill-climbing algorithm to optimize its MSA alignment score^[8] and iteratively corrects both alignment weights and locally divergent or "gappy" regions of the growing MSA^[9]. PRRP performs best when refining an alignment previously constructed by a faster method.

Another iterative program, DIALIGN, takes an unusual approach of focusing narrowly on local alignments between sub-segments or sequence motifs without introducing a gap penalty^[10]. The alignment of individual motifs is then achieved with a matrix representation similar to a dot-matrix plot in a pairwise alignment. DIALIGN is also available as a web portal at CHAOS/DIALIGN.

A third popular iteration-based method called MUSCLE (multiple sequence alignment by log-expectation) improves on progressive methods with a more accurate distance measure to assess the relatedness of two sequences^[11]. The distance measure is updated between iteration stages (although, in its original form, MUSCLE contained only 2-3 iterations depending on whether refinement was enabled). A web portal and download site is available at MUSCLE.

References

^ Wang L, Jiang T. (1994) On the complexity of multiple sequence alignment. J Comput Biol 1:337-348.
^ Just W. (2001). Computational complexity of multiple sequence alignment with SP-score. J Comput Biol 8(6):615-23.
^ Higgins DG, Sharp PM. (1988). CLUSTAL: a package for performing multiple sequence alignment on a microcomputer. Gene 73(1):237-44.
^ Thompson JD, Higgins DG, Gibson TJ. (1994). CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice. Nucleic Acids Res 22:4673-4680.
^ Notredame C, Higgins DG, Heringa J. (2000). T-Coffee: A novel method for fast and accurate multiple sequence alignment. J Mol Biol 302(1):205-17.
^ Sze SH, Lu Y, Yang Q. (2006). A polynomial time solvable formulation of multiple sequence alignment. J Comput Biol 13(2):309-19.
^ Hirosawa M, Totoki Y, Hoshida M, Ishikawa M. (1995). Comprehensive study on iterative algorithms of multiple sequence alignment. Comput Appl Biosci 11:13-18.
^ Gotoh O. (1996). Significant improvement in accuracy of multiple protein sequence alignments by iterative refinement as assessed by reference to structural alignments. J Mol Biol 264(4):823-38.
^ Mount DM. (2004). Bioinformatics: Sequence and Genome Analysis 2nd ed. Cold Spring Harbor Laboratory Press: Cold Spring Harbor, NY.
^ Brudno M, Chapman M, Göttgens B, Batzoglou S, Morgenstern B. (2003) Fast and sensitive multiple alignment of large genomic sequences. BMC Bioinformatics 4:66.
^ Edgar RC. (2004), MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Research 32(5), 1792-97.

Survey articles

Duret, L. (2000). "Multiple alignment for structural functional or phylogenetic analyses of homologous sequences". In D. Higgins and W. Taylor (ed.). Bioinformatics sequence structure and databanks. Oxford: Oxford University Press. {{cite book}}: Unknown parameter |coauthors= ignored (|author= suggested) (help)

Notredame, C. (2002). "Recent progresses in multiple sequence alignment: a survey". Pharmacogenomics. 31 (1): 131 -- 144.

Thompson, J. D. (1999). "A comprehensive comparison of multiple sequence alignment programs". Nucleic Acids Research. 27 (13): 12682--2690. {{cite journal}}: Unknown parameter |coauthors= ignored (|author= suggested) (help)