Multiple sequence alignment: Difference between revisions

Content deleted Content added
GreenC bot (talk | contribs)
Removed oxfordjournals.com URL per discussion. Wayback Medic 2.5
Link suggestions feature: 3 links added.
 
(27 intermediate revisions by 15 users not shown)
Line 1:
{{shortShort description|Alignment of more than two molecular sequencesequences}}
[[File:RPLP0 90 ClustalW aln.gif|right|thumb|575pxupright=1.36|First 90 positions of a [[protein]] multiple sequence alignment of instances of the acidic ribosomal protein P0 (L10E) from several organisms. Generated with [[ClustalXClustal]]X.]]
'''Multiple sequence alignment''' ('''MSA''') may refer tois the process or the result of [[sequence alignment]] of three or more [[biological sequence]]s, generally [[protein]], [[DNA]], or [[RNA]]. InThese many cases, the input set of query sequencesalignments are assumedused to have aninfer [[evolutionary]] relationshiprelationships byvia which[[phylogenetic]] they share a linkageanalysis and arecan descended from a common ancestor. From the resulting MSA, sequencehighlight [[homologyHomology (biology)|homologyhomologous]] canfeatures be inferred and [[molecular phylogeny|phylogenetic analysis]] can be conducted to assess thebetween sequences' shared evolutionary origins. Visual depictions of the alignment as in the image at rightAlignments illustratehighlight [[mutation]] events such as [[point mutations]] (single [[amino acid]] or [[nucleotide]] changes) that appear as differing characters in a single alignment column, and [[insertion ormutation]]s deletionand mutations ([[indeldeletion mutation]]s, orand gaps)alignments that appear as hyphens in one or more of the sequences in the alignment. Multiple sequence alignment is oftenare used to assess sequence [[conservationConservation (genetics)|conservation]] and infer the presence and activity of [[protein ___domain]]s, [[tertiary structure|tertiary]] ands, [[secondary structure|secondary]] structuress, and even individual amino acids or nucleotides.
 
ComputationalMultiple [[algorithm]]ssequence are used to produce and analyse the MSAs due to the difficulty and intractability of manually processing the sequences given their biologically-relevant length. MSAsalignments require more sophisticated methodologies than [[sequenceSequence alignment#Pairwise alignment|pairwise alignmentalignments]], becauseas they are more [[Computational complexity theory|computationally complex]]. Most multiple sequence alignment programs use [[heuristic]] methods rather than [[global optimization]] because identifying the optimal alignment between more than a few sequences of moderate length is prohibitively computationally expensive. On the other handHowever, heuristic methods generally failcannot toguarantee givehigh-quality guaranteessolutions onand thehave solution quality, with heuristic solutionsbeen shown to befail oftento far below theyield near-optimal solutionsolutions on benchmark instancestest cases.<ref name="thompson2011">{{cite journal | doi = 10.1371/journal.pone.0018093|vauthors= Thompson JD, Linard B, Lecompte O, Poch O | year = 2011 | title = A comprehensive benchmark study of multiple sequence alignment methods: current challenges and future perspectives | journal = [[PLOS ONEOne]] | volume = 6 | issue =3 3| pages =e18093 e18093| pmid =21483869 21483869| pmc = 3069049|bibcode= 2011PLoSO...618093T |doi-access=free}}</ref><ref name="nuin2006" /><ref name="hosseininasab">{{cite journal | doi = 10.1287/ijoc.2019.0937 |vauthors=Hosseininasab A, van Hoeve WJ | year = 2019 | title = Exact Multiple Sequence Alignment by Synchronized Decision Diagrams | journal = INFORMS Journal on Computing |s2cid=109937203}}</ref>
 
==Problem statement==
 
Given <math>m</math> sequences <math>S_i</math>, <math>i = 1,\cdots,m</math> similar to the form below:
 
Line 42 ⟶ 41:
 
===Tracing alignments===
 
When determining the best suited alignments for each MSA, a ''trace'' is usually generated. A trace is a set of ''realized'', or corresponding and aligned, vertices that has a specific weight based on the edges that are selected between corresponding vertices. When choosing traces for a set of sequences it is necessary to choose a trace with a maximum weight to get the best alignment of the sequences.
 
==Alignment methods==
 
There are various alignment methods used within multiple sequence to maximize scores and correctness of alignments. Each is usually based on a certain heuristic with an insight into the evolutionary process. Most try to replicate evolution to get the most realistic alignment possible to best predict relations between sequences.
 
=== Dynamic programming ===
A direct method for producing an MSA uses the [[dynamic programming]] technique to identify the globally optimal alignment solution. For proteins, this method usually involves two sets of parameters: a [[gap penalty]] and a [[substitution matrix]] assigning scores or probabilities to the alignment of each possible pair of amino acids based on the similarity of the amino acids' chemical properties and the evolutionary probability of the mutation. For nucleotide sequences, a similar gap penalty is used, but a much simpler substitution matrix, wherein only identical matches and mismatches are considered, is typical. The scores in the substitution matrix may be either all positive or a mix of positive and negative in the case of a global alignment, but must be both positive and negative, in the case of a local alignment.<ref>{{cite web|title=Help with matrices used in sequence comparison tools|url=http://www.ebi.ac.uk/help/matrix.html|url-status=dead|archive-url=https://web.archive.org/web/20100311140200/http://www.ebi.ac.uk/help/matrix.html|archive-date=March 11, 2010|access-date=March 3, 2010|publisher=European Bioinformatics Institute}}</ref>
 
For ''n'' individual sequences, the naive method requires constructing the ''n''-dimensional equivalent of the matrix formed in standard pairwise [[sequence alignment]]. The search space thus increases exponentially with increasing ''n'' and is also strongly dependent on sequence length. Expressed with the [[big O notation]] commonly used to measure [[Computational complexity theory|computational complexity]], a [[Naïve algorithm|naïve]] MSA takes ''O(Length<sup>Nseqs</sup>)'' time to produce. To find the global optimum for ''n'' sequences this way has been shown to be an [[NP complete-completeness|NP-complete]] problem.<ref name="wang">{{cite journal|vauthors=Wang L, Jiang T|year=1994|title=On the complexity of multiple sequence alignment|journal=J Comput Biol|volume=1|issue=4|pages=337–348|citeseerx=10.1.1.408.894|doi=10.1089/cmb.1994.1.337|pmid=8790475}}</ref><ref name="just">{{cite journal|author=Just W|year=2001|title=Computational complexity of multiple sequence alignment with SP-score|journal=J Comput Biol|volume=8|issue=6|pages=615–23|citeseerx=10.1.1.31.6382|doi=10.1089/106652701753307511|pmid=11747615}}</ref><ref name="elias">{{cite journal|author=Elias, Isaac|year=2006|title=Settling the intractability of multiple alignment|journal=J Comput Biol|volume=13|issue=7|pages=1323–1339|citeseerx=10.1.1.6.256|doi=10.1089/cmb.2006.13.1323|pmid=17037961}}</ref> In 1989, based on Carrillo-Lipman Algorithm,<ref name="carrillo">{{cite journal|vauthors=Carrillo H, Lipman DJ|year=1988|title=The Multiple Sequence Alignment Problem in Biology|url=https://zenodo.org/record/1236134|journal=SIAM Journal on Applied Mathematics|volume=48|issue=5|pages=1073–1082|doi=10.1137/0148063}}</ref> Altschul introduced a practical method that uses pairwise alignments to constrain the n-dimensional search space.<ref name="altschul">{{cite journal|vauthors=Lipman DJ, Altschul SF, Kececioglu JD|year=1989|title=A tool for multiple sequence alignment|journal=Proc Natl Acad Sci U S A|volume=86|issue=12|pages=4412–4415|bibcode=1989PNAS...86.4412L|doi=10.1073/pnas.86.12.4412|pmc=287279|pmid=2734293|doi-access=free}}</ref> In this approach pairwise dynamic programming alignments are performed on each pair of sequences in the query set, and only the space near the n-dimensional intersection of these alignments is searched for the n-way alignment. The MSA program optimizes the sum of all of the pairs of characters at each position in the alignment (the so-called ''sum of pair'' score) and has been implemented in a software program for constructing multiple sequence alignments.<ref>{{cite web|title=Genetic analysis software|url=https://www.ncbi.nlm.nih.gov/CBBresearch/Schaffer/msa.html|archive-url=https://web.archive.org/web/20000119082433/http://www.ncbi.nlm.nih.gov/CBBresearch/Schaffer/msa.html|url-status=dead|archive-date=January 19, 2000|access-date=March 3, 2010|publisher=National Center for Biotechnology Information}}</ref> In 2019, Hosseininasab and van Hoeve showed that by using decision diagrams, MSA may be modeled in polynomial space complexity.<ref name="hosseininasab" />
 
=== Progressive alignment construction ===
The most widely used approach to multiple sequence alignments uses a heuristic search known as progressive technique (also known as the hierarchical or tree method) developed by Da-Fei Feng and Doolittle in 1987.<ref name="feng1987progressive">{{cite journal |vauthors=Feng DF, Doolittle RF |year=1987 |title=Progressive sequence alignment as a prerequisitettoprerequisitet to correct phylogenetic trees |journal=J Mol Evol |volume=25 |issue=4 |pages=351–360 |pmid=3118049|doi=10.1007/BF02603120 |bibcode=1987JMolE..25..351F |s2cid=6345432 }}</ref> Progressive alignment builds up a final MSA by combining pairwise alignments beginning with the most similar pair and progressing to the most distantly related. All progressive alignment methods require two stages: a first stage in which the relationships between the sequences are represented as a [[phylogenetic tree|tree]], called a ''guide tree'', and a second step in which the MSA is built by adding the sequences sequentially to the growing MSA according to the guide tree. The initial ''guide tree'' is determined by an efficient [[Cluster analysis|clustering]] method such as [[neighbor-joining]] or ''unweighted pair group method with arithmetic mean'' ([[UPGMA]]), and may use distances based on the number of identical two-letter sub-sequences (as in [[FASTA]] rather than a dynamic programming alignment).<ref name="mount"/>
 
Progressive alignments are not guaranteed to be globally optimal. The primary problem is that when errors are made at any stage in growing the MSA, these errors are then propagated through to the final result. Performance is also particularly bad when all of the sequences in the set are rather distantly related. Most modern progressive methods modify their scoring function with a secondary weighting function that assigns scaling factors to individual members of the query set in a nonlinear fashion based on their phylogenetic distance from their nearest neighbors. This corrects for non-random selection of the sequences given to the alignment program.<ref name="mount"/>
 
Progressive alignment methods are efficient enough to implement on a large scale for many (100s to 1000s) sequences. Progressive alignment services are commonly available on publicly accessible web servers so users need not locally install the applications of interest. The mostA popular progressive alignment method has been the [[Clustal]] family,.<ref name="higgins1988">{{cite journal |author=[[Desmond G. Higgins|Higgins DG]], Sharp PM |year=1988 |title=CLUSTAL: a package for performing multiple sequence alignment on a microcomputer |journal=Gene |volume=73 |issue=1 |pages=237–244 |doi=10.1016/0378-1119(88)90330-7 |pmid=3243435}}</ref> especially the weighted variant ClustalW<ref name="thomson1994">{{cite journal | vauthors = Thompson JD, Higgins DG, Gibson TJ | title = CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice | journal = Nucleic Acids Res. | volume = 22 | issue = 22 | pages = 4673–80 | date = November 1994 | pmid = 7984417 | pmc = 308517 | doi = 10.1093/nar/22.22.4673 }}</ref> to which access is provided by a large number of web portals including [https://web.archive.org/web/20060705082556/http://align.genome.jp/ GenomeNet], [http://www.ebi.ac.uk/Tools/clustalw2/index.html EBIClustal], and [http://www.ch.embnet.org/software/ClustalW.html EMBNet]. Different portals or implementations can vary in user interface and make different parameters accessible to the user. ClustalWW is used extensively for phylogenetic tree construction, in spite of the author's explicit warnings that unedited alignments should not be used in such studies and as input for [[protein structure prediction]] by homology modeling. Current[[European versionBioinformatics of Clustal family is ClustalW2.Institute]] (EMBL-EBI) announced that CLustalW2 will be expiredexpire in August 2015. They recommend [[Clustal]] Omega which performs based on seeded guide trees and HMM profile-profile techniques for protein alignments. They offer differentAn MSAalternative toolstool for progressive DNA alignments. Oneis of''multiple themalignment isusing fast Fourier transform'' ([[MAFFT]] (Multiple Alignment using Fast Fourier Transform).<ref name=EMBL-EBI>{{cite web|title=EMBL-EBI-ClustalW2-Multiple Sequence Alignment|url=http://www.ebi.ac.uk/Tools/msa/clustalw2/|website=CLUSTALW2}}</ref>
 
Another common progressive alignment method callednamed [[T-Coffee]]<ref name="notredame2000">{{cite journal | vauthors = Notredame C, Higgins DG, Heringa J | title = T-Coffee: A novel method for fast and accurate multiple sequence alignment | journal = J. Mol. Biol. | volume = 302 | issue = 1 | pages = 205–17 | date = September 2000 | pmid = 10964570 | doi = 10.1006/jmbi.2000.4042 |s2cid=10189971}}</ref> is slower than Clustal and its derivatives but generally produces more accurate alignments for distantly related sequence sets. T-Coffee calculates pairwise alignments by combining the direct alignment of the pair with indirect alignments that aligns each sequence of the pair to a third sequence. It uses the output from Clustal as well as another local alignment program LALIGN, which finds multiple regions of local alignment between two sequences. The resulting alignment and phylogenetic tree are used as a guide to produce new and more accurate weighting factors.
 
Because progressive methods are heuristics that are not guaranteed to converge to a global optimum, alignment quality can be difficult to evaluate and their true biological significance can be obscure. A semi-progressive method that improves alignment quality and does not use a lossy heuristic while still running in [[polynomial time]] has been implemented in the program [http://faculty.cs.tamu.edu/shsze/psalign/ PSAlign].<ref name="sze2006">{{cite journal |vauthors=Sze SH, Lu Y, Yang Q |year=2006 |title=A polynomial time solvable formulation of multiple sequence alignment |journal=J Comput Biol |volume=13 |issue=2 |pages=309–319 |doi=10.1089/cmb.2006.13.309 |pmid=16597242}}</ref>
 
===Iterative methods===
A set of methods to produce MSAs while reducing the errors inherent in progressive methods are classified as "iterative" because they work similarly to progressive methods but repeatedly realign the initial sequences as well as adding new sequences to the growing MSA. One reason progressive methods are so strongly dependent on a high-quality initial alignment is the fact that these alignments are always incorporated into the final result&nbsp;— that is, once a sequence has been aligned into the MSA, its alignment is not considered further. This approximation improves efficiency at the cost of accuracy. By contrast, iterative methods can return to previously calculated pairwise alignments or sub-MSAs incorporating subsets of the query sequence as a means of optimizing a general [[objective function]] such as finding a high-quality alignment score.<ref name="mount"/>
 
A variety of subtly different iteration methods have been implemented and made available in software packages; reviews and comparisons have been useful but generally refrain from choosing a "best" technique.<ref name="hirosawa">{{cite journal |vauthors=Hirosawa M, Totoki Y, Hoshida M, Ishikawa M | year = 1995 | title = Comprehensive study on iterative algorithms of multiple sequence alignment | journal =Computer ComputApplications Applin Bioscithe |Biosciences |volume = 11 | issue =1 1| pages = 13–18 | pmid = 7796270 | doi=10.1093/bioinformatics/11.1.13}}</ref> The software package [http://www.genome.jp/tools/prrn/ PRRN/PRRP] uses a [[hill-climbing algorithm]] to optimize its MSA alignment score<ref name="gotoh">{{cite journal | doi = 10.1006/jmbi.1996.0679 | author = Gotoh O | year = 1996 | title = Significant improvement in accuracy of multiple protein sequence alignments by iterative refinement as assessed by reference to structural alignments | journal = J Mol Biol | volume = 264 | issue =4 4| pages = 823–38 | pmid = 8980688 }}</ref> and iteratively corrects both alignment weights and locally divergent or "gappy" regions of the growing MSA.<ref name="mount">Mount DM. (2004). Bioinformatics: Sequence and Genome Analysis 2nd ed. Cold Spring Harbor Laboratory Press: Cold Spring Harbor, NY.</ref> PRRP performs best when refining an alignment previously constructed by a faster method.<ref name="mount"/>
 
Another iterative program, DIALIGN, takes an unusual approach of focusing narrowly on local alignments between sub-segments or [[sequence motif]]s without introducing a gap penalty.<ref name="brudno">{{cite journal | vauthors = Brudno M, Chapman M, Göttgens B, Batzoglou S, Morgenstern B | title = Fast and sensitive multiple alignment of large genomic sequences | journal = BMC Bioinformatics | volume = 4 | pages = 66 | date = December 2003 | pmid = 14693042 | pmc = 521198 | doi = 10.1186/1471-2105-4-66 |doi-access=free}}</ref> The alignment of individual motifs is then achieved with a matrix representation similar to a dot-matrix plot in a pairwise alignment. An alternative method that uses fast local alignments as anchor points or "''seeds"'' for a slower global-alignment procedure is implemented in the [http://dialign.gobics.de/chaos-dialign-submission CHAOS/DIALIGN] suite.<ref name="brudno"/>
 
A third popular iteration-based method callednamed [[MUSCLE (alignment software)|MUSCLE]] (multiple sequence alignment by log-expectation) improves on progressive methods with a more accurate distance measure to assess the relatedness of two sequences.<ref name="edgar">{{cite journal | doi = 10.1093/nar/gkh340 | author = Edgar RC | year = 2004 | title = MUSCLE: multiple sequence alignment with high accuracy and high throughput | journal = Nucleic Acids Research | volume = 32 | issue =5 5| pages = 1792–97 | pmid=15034147 | pmc=390337}}</ref> The distance measure is updated between iteration stages (although, in its original form, MUSCLE contained only 2-3 iterations depending on whether refinement was enabled).
 
===Consensus methods===
Consensus methods attempt to find the optimal multiple sequence alignment given multiple different alignments of the same set of sequences. There are two commonly used consensus methods, [http://www.tcoffee.org/Projects/mcoffee/ M-COFFEE] and [http://www.stevekellylab.com/software/mergealign MergeAlign].<ref name="mergealign">{{cite journal | doi = 10.1186/1471-2105-13-117 |vauthors=Collingridge PW, Kelly S | year = 2012 | title = MergeAlign: improving multiple sequence alignment performance by dynamic reconstruction of consensus multiple sequence alignments | journal = BMC Bioinformatics | volume = 13 | issue = 117 |pages=117 | pmid=22646090 | pmc=3413523 |doi-access=free}}</ref> M-COFFEE uses multiple sequence alignments generated by seven different methods to generate consensus alignments. MergeAlign is capable of generating consensus alignments from any number of input alignments generated using different models of sequence evolution or different methods of multiple sequence alignment. The default option for MergeAlign is to infer a consensus alignment using alignments generated using 91 different models of protein sequence evolution.
 
===Hidden Markov models===
[[File:A profile HMM modelling a multiple sequence alignment.png|thumb|A profile [[hidden Markov model]] (HMM) modelling a multiple sequence alignment]]
A [[Hiddenhidden Markov model]]s are(HMM) is a probabilistic modelsmodel that can assign likelihoods to all possible combinations of gaps, matches, and mismatches, to determine the most likely MSA or set of possible MSAs. HMMs can produce a single highest-scoring output but can also generate a family of possible alignments that can then be evaluated for biological significance. HMMs can produce both global and local alignments. Although HMM-based methods have been developed relatively recently, they offer significant improvements in computational speed, especially for sequences that contain overlapping regions.<ref name="mount" />
 
Typical HMM-based methods work by representing an MSA as a form of [[directed acyclic graph]] known as a partial-order graph, which consists of a series of nodes representing possible entries in the columns of an MSA. In this representation a column that is absolutely conserved (that is, that all the sequences in the MSA share a particular character at a particular position) is coded as a single node with as many outgoing connections as there are possible characters in the next column of the alignment. In the terms of a typical hidden Markov model, the observed states are the individual alignment columns and the "hidden" states represent the presumed ancestral sequence from which the sequences in the query set are hypothesized to have descended. An efficient search variant of the dynamic programming method, known asnamed the [[Viterbi algorithm]], is generally used to successively align the growing MSA to the next sequence in the query set to produce a new MSA.<ref name="hughey">{{cite journal |vauthors=Hughey R, Krogh A | year = 1996 | title = Hidden Markov models for sequence analysis: extension and analysis of the basic method | journal = CABIOS | volume = 12 | issue =2 2| pages = 95–107 | pmid = 8744772 | doi=10.1093/bioinformatics/12.2.95 | citeseerx = 10.1.1.44.3365 }}</ref> This is distinct from progressive alignment methods because the alignment of prior sequences is updated at each new sequence addition. However, like progressive methods, this technique can be influenced by the order in which the sequences in the query set are integrated into the alignment, especially when the sequences are distantly related.<ref name="mount" />
[[Hidden Markov model]]s are probabilistic models that can assign likelihoods to all possible combinations of gaps, matches, and mismatches to determine the most likely MSA or set of possible MSAs. HMMs can produce a single highest-scoring output but can also generate a family of possible alignments that can then be evaluated for biological significance. HMMs can produce both global and local alignments. Although HMM-based methods have been developed relatively recently, they offer significant improvements in computational speed, especially for sequences that contain overlapping regions.<ref name="mount" />
 
Several software programs are available in which variants of HMM-based methods have been implemented and which are noted for their scalability and efficiency, although properly using an HMM method is more complex than using more common progressive methods. The simplest is [http://sourceforge.net/projects/poamsa/files/ POA] (Partial-Order Alignment (POA)<!--this download link is temporary, remember to replace when it's fixed-->;<ref name="grasso">{{cite journal | doi = 10.1093/bioinformatics/bth126 |vauthors=Grasso C, Lee C | year = 2004 | title = Combining partial order alignment and progressive multiple sequence alignment increases alignment speed and scalability to very large alignment problems | journal = Bioinformatics | volume = 20 | issue =10 10| pages = 1546–56 | pmid = 14962922 | doi-access = free }}</ref> and a similar but more generalizedgeneral method is implemented in the packages [http://compbio.soe.ucsc.edu/sam.html SAM] (Sequence Alignment and Modeling System (SAM) software package.<ref name="hugheyT">Hughey R, Krogh A. SAM: Sequence alignment and modeling software system. Technical Report UCSC-CRL-96-22, University of California, Santa Cruz, CA, September 1996.</ref> and [[HMMER]].<ref name="durbin">Durbin R, Eddy S, Krogh A, Mitchison G. (1998). Biological sequence analysis: probabilistic models of proteins and nucleic acids, Cambridge University Press, 1998.</ref>
Typical HMM-based methods work by representing an MSA as a form of [[directed acyclic graph]] known as a partial-order graph, which consists of a series of nodes representing possible entries in the columns of an MSA. In this representation a column that is absolutely conserved (that is, that all the sequences in the MSA share a particular character at a particular position) is coded as a single node with as many outgoing connections as there are possible characters in the next column of the alignment. In the terms of a typical hidden Markov model, the observed states are the individual alignment columns and the "hidden" states represent the presumed ancestral sequence from which the sequences in the query set are hypothesized to have descended. An efficient search variant of the dynamic programming method, known as the [[Viterbi algorithm]], is generally used to successively align the growing MSA to the next sequence in the query set to produce a new MSA.<ref name="hughey">{{cite journal |vauthors=Hughey R, Krogh A | year = 1996 | title = Hidden Markov models for sequence analysis: extension and analysis of the basic method | journal = CABIOS | volume = 12 | issue = 2| pages = 95–107 | pmid = 8744772 | doi=10.1093/bioinformatics/12.2.95| citeseerx = 10.1.1.44.3365 }}</ref> This is distinct from progressive alignment methods because the alignment of prior sequences is updated at each new sequence addition. However, like progressive methods, this technique can be influenced by the order in which the sequences in the query set are integrated into the alignment, especially when the sequences are distantly related.<ref name="mount" />
SAM has been used as a source of alignments for [[protein structure prediction]] to participate in the Critical Assessment of Structure Prediction ([[CASP]]) structure prediction experiment and to develop a database of predicted proteins in the [[yeast]] species ''[[S. cerevisiae]]''. [[HHpred / HHsearchHH-suite|HHsearch]]<ref>{{ cite journal | author = Söding J | title = Protein homology detection by HMM-HMM comparison| |journal =Bioinformatics Bioinformatics| year =2005 2005| volume =21 21| issue =7 7| pages =951–960 951–960| pmid =15531603 15531603| doi = 10.1093/bioinformatics/bti125 | citeseerx = 10.1.1.519.1257}}</ref> is a software package for the detection of remotely related protein sequences based on the pairwise comparison of HMMs. A server running HHsearch ([[HHpred / HHsearchHH-suite|HHpred]]) was by far the fastest of the 10 best automatic structure prediction servers in the CASP7 and CASP8 structure prediction competitions.<ref>{{ cite journal|vauthors=Battey JN, Kopp J, Bordoli L, Read RJ, Clarke ND, Schwede T | title = Automated server predictions in CASP7 | journal = Proteins | year = 2007 | volume = 69 | issue = Suppl 8 | pages = 68–82 | pmid =17894354 17894354| doi = 10.1002/prot.21761 | s2cid =29879391 29879391| doi-access = free }}</ref>
 
Several software programs are available in which variants of HMM-based methods have been implemented and which are noted for their scalability and efficiency, although properly using an HMM method is more complex than using more common progressive methods. The simplest is [http://sourceforge.net/projects/poamsa/files/ POA] (Partial-Order Alignment)<!--this download link is temporary, remember to replace when it's fixed-->;<ref name="grasso">{{cite journal | doi = 10.1093/bioinformatics/bth126 |vauthors=Grasso C, Lee C | year = 2004 | title = Combining partial order alignment and progressive multiple sequence alignment increases alignment speed and scalability to very large alignment problems | journal = Bioinformatics | volume = 20 | issue = 10| pages = 1546–56 | pmid = 14962922 | doi-access = free }}</ref> a similar but more generalized method is implemented in the packages [http://compbio.soe.ucsc.edu/sam.html SAM] (Sequence Alignment and Modeling System).<ref name="hugheyT">Hughey R, Krogh A. SAM: Sequence alignment and modeling software system. Technical Report UCSC-CRL-96-22, University of California, Santa Cruz, CA, September 1996.</ref> and [[HMMER]].<ref name="durbin">Durbin R, Eddy S, Krogh A, Mitchison G. (1998). Biological sequence analysis: probabilistic models of proteins and nucleic acids, Cambridge University Press, 1998.</ref>
SAM has been used as a source of alignments for [[protein structure prediction]] to participate in the [[CASP]] structure prediction experiment and to develop a database of predicted proteins in the [[yeast]] species ''[[S. cerevisiae]]''. [[HHpred / HHsearch|HHsearch]]<ref>{{ cite journal| author = Söding J | title = Protein homology detection by HMM-HMM comparison| journal = Bioinformatics| year = 2005| volume = 21| issue = 7| pages = 951–960| pmid = 15531603| doi = 10.1093/bioinformatics/bti125| citeseerx = 10.1.1.519.1257}}</ref> is a software package for the detection of remotely related protein sequences based on the pairwise comparison of HMMs. A server running HHsearch ([[HHpred / HHsearch|HHpred]]) was by far the fastest of the 10 best automatic structure prediction servers in the CASP7 and CASP8 structure prediction competitions.<ref>{{ cite journal|vauthors=Battey JN, Kopp J, Bordoli L, Read RJ, Clarke ND, Schwede T | title = Automated server predictions in CASP7| journal = Proteins | year = 2007 | volume = 69 | issue = Suppl 8 | pages = 68–82 | pmid = 17894354| doi = 10.1002/prot.21761 | s2cid = 29879391| doi-access = free }}</ref>
 
===Phylogeny-aware methods===
[[File:Non-homologous exon alignment.jpg|thumb|450px|Non-homologous exon alignment by an iterative method (a), and by a phylogeny-aware method (b)]]
Most multiple sequence alignment methods try to minimize the number of [[Indel|insertions/deletions]] (gaps) and, as a consequence, produce compact alignments. This causes several problems if the sequences to be aligned contain non-[[Homology (biology)#Sequence homology|homologous]] regions, if gaps are informative in a [[phylogeny]] analysis. These problems are common in newly produced sequences that are poorly annotated and may contain [[Frameshift|frame-shifts]], wrong [[Protein ___domain|domains]] or non-homologous [[RNA splicing|spliced]] [[exonsexon]]s. The first such method was developed in 2005 by Löytynoja and Goldman.<ref name="Loytynoja-2005">{{Cite journal | last1 = Loytynoja | first1 = A. | title = An algorithm for progressive multiple alignment of sequences with insertions | doi = 10.1073/pnas.0409137102 | journal = Proceedings of the National Academy of Sciences | volume = 102 | issue = 30 | pages = 10557–10562 | year = 2005 | pmid =16000407 16000407| pmc =1180752 1180752| bibcode = 2005PNAS..10210557L |doi-access=free}}</ref> The same authors released a software package called ''PRANK'' in 2008.<ref name="Loytynoja-2008">{{cite journal | vauthors = Löytynoja A, Goldman N | title = Phylogeny-aware gap placement prevents errors in sequence alignment and evolutionary analysis | journal = Science | volume = 320 | issue = 5883 | pages = 1632–5 | date = June 2008 | pmid = 18566285 | doi = 10.1126/science.1158395 | bibcode = 2008Sci...320.1632L | s2cid = 5211928 }}</ref> PRANK improves alignments when insertions are present. Nevertheless, it runs slowly compared to progressive and/or iterative methods which have been developed for several years.
 
In 2012, two new phylogeny-aware tools appeared. One is called ''PAGAN'' that was developed by the same team as PRANK.<ref name="Loytynoja-2012">{{cite journal | vauthors = Löytynoja A, Vilella AJ, Goldman N | title = Accurate extension of multiple sequence alignments using a phylogeny-aware graph algorithm | journal = Bioinformatics | volume = 28 | issue = 13 | pages = 1684–91 | date = July 2012 | pmid = 22531217 | pmc = 3381962 | doi = 10.1093/bioinformatics/bts198 }}</ref> The other is ''ProGraphMSA'' developed by Szalkowski.<ref name="Szalkowski-2012">{{cite journal | vauthors = Szalkowski AM | title = Fast and robust multiple sequence alignment with phylogeny-aware gap placement | journal = BMC Bioinformatics | volume = 13 | pages = 129 | date = June 2012 | pmid = 22694311 | pmc = 3495709 | doi = 10.1186/1471-2105-13-129 |doi-access=free}}</ref> Both software packages were developed independently but share common features, notably the use of [[Directed acyclic graph#Partial orders and topological ordering|graph algorithms]] to improve the recognition of non-homologous regions, and an improvement in code making these software faster than PRANK.
 
===Motif finding===
[[File:Caspase-motif-alignment.png|right|thumb|500px|Alignment of the seven [[Drosophila]] [[caspase]]s colored by motifs as identified by MEME. When motif positions and sequence alignments are generated independently, they often correlate well but not perfectly, as in this example.]]
Motif finding, also known as profile analysis, is a method of locating [[sequence motif]]s in global MSAs that is both a means of producing a better MSA and a means of producing a scoring matrix for use in searching other sequences for similar motifs. A variety of methods for isolating the motifs have been developed, but all are based on identifying short highly conserved patterns within the larger alignment and constructing a matrix similar to a substitution matrix that reflects the amino acid or nucleotide composition of each position in the putative motif. The alignment can then be refined using these matrices. In standard profile analysis, the matrix includes entries for each possible character as well as entries for gaps.<ref name="mount" /> Alternatively, statistical pattern-finding algorithms can identify motifs as a precursor to an MSA rather than as a derivation. In many cases when the query set contains only a small number of sequences or contains only highly related sequences, [[pseudocount]]s are added to normalize the distribution reflected in the scoring matrix. In particular, this corrects zero-probability entries in the matrix to values that are small but nonzero.
 
Blocks analysis is a method of motif finding that restricts motifs to ungapped regions in the alignment. Blocks can be generated from an MSA or they can be extracted from unaligned sequences using a precalculated set of common motifs previously generated from known gene families.<ref name="henikoff1991">{{cite journal | vauthors = Henikoff S, Henikoff JG | title = Automated assembly of protein blocks for database searching | journal = Nucleic Acids Res. | volume = 19 | issue = 23 | pages = 6565–72 | date = December 1991 | pmid = 1754394 | pmc = 329220 | doi = 10.1093/nar/19.23.6565 }}</ref> Block scoring generally relies on the spacing of high-frequency characters rather than on the calculation of an explicit substitution matrix. The [https://web.archive.org/web/20130328131920/http://blocks.fhcrc.org/ BLOCKS] server provides an interactive method to locate such motifs in unaligned sequences.
 
Statistical pattern-matching has been implemented using both the [[expectation-maximization algorithm]] and the [[Gibbs sampler]]. One of the most common motif-finding tools, known asnamed [[Multiple EM for Motif Elicitation|MEME]] (MEME), uses expectation maximization and hidden Markov methods to generate motifs that are then used as search tools by its companion MAST in the combined suite [http://meme.sdsc.edu/meme/intro.html MEME/MAST].<ref name="baileyelkan1994">{{cite book |vauthors=Bailey TL, Elkan C |year=1994 |chapter=Fitting a mixture model by expectation maximization to discover motifs in biopolymers |title=Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology |pages=28–36 |publisher=AAAI Press |___location=Menlo Park, California|chapter-url=http://www.cs.toronto.edu/~brudno/csc2417_15/10.1.1.121.7056.pdf}}</ref><ref name="baileygribskov1998">{{cite journal | vauthors = Bailey TL, Gribskov M | title = Combining evidence using p-values: application to sequence homology searches | journal = Bioinformatics | volume = 14 | issue = 1 | pages = 48–54 | date = 1998 | pmid = 9520501 | doi = 10.1093/bioinformatics/14.1.48 | doi-access = free }}</ref>
 
===Non-coding multiple sequence alignment===
[[Non-coding DNA]] regions, especially [[transcription factor]] binding sites (TFBSs), are rather more conserved, andbut not necessarily evolutionarily related, and may have converged from non-common ancestors. Thus, the assumptions used to align protein sequences and DNA coding regions are inherently different from those that hold for TFBS sequences. Although it is meaningful to align DNA coding regions for homologous sequences using mutation operators, alignment of binding site sequences for the same transcription factor cannot rely on evolutionary related mutation operations. Similarly, the evolutionary operator of point mutations can be used to define an edit distance for coding sequences, but this has little meaning for TFBS sequences because any sequence variation has to maintain a certain level of specificity for the binding site to function. This becomes specifically important when trying to align known TFBS sequences to build supervised models to predict unknown locations of the same TFBS. Hence, Multiple Sequence Alignment methods need to adjust the underlying evolutionary hypothesis and the operators used as in the work published incorporating neighbouring base thermodynamic information <ref name=Salama2013>{{cite journal | vauthors = Salama RA, Stekel DJ | title = A non-independent energy-based multiple sequence alignment improves prediction of transcription factor binding sites | journal = Bioinformatics | volume = 29 | issue = 21 | pages = 2699–704 | date = November 2013 | pmid = 23990411 | doi = 10.1093/bioinformatics/btt463 | doi-access = free }}</ref> to align the binding sites searching for the lowest thermodynamic alignment conserving specificity of the binding site, [http://sourceforge.net/projects/msa-edna/ EDNA] .
 
==Optimization==
 
===Genetic algorithms and simulated annealing===
Standard optimization techniques in computer science&nbsp;— both of which were inspired by, but do not directly reproduce, physical processes&nbsp;— have also been used in an attempt to more efficiently produce quality MSAs. One such technique, [[genetic algorithm]]s, has been used for MSA production in an attempt to broadly simulate the hypothesized evolutionary process that gave rise to the divergence in the query set. The method works by breaking a series of possible MSAs into fragments and repeatedly rearranging those fragments with the introduction of gaps at varying positions. A general [[objective function]] is optimized during the simulation, most generally the "sum of pairs" maximization function introduced in dynamic programming-based MSA methods. A technique for protein sequences has been implemented in the software program SAGA (Sequence Alignment by Genetic Algorithm)<ref name="notredame2">{{cite journal | vauthors = Notredame C, Higgins DG | title = SAGA: sequence alignment by genetic algorithm | journal = Nucleic Acids Res. | volume = 24 | issue = 8 | pages = 1515–24 | date = April 1996 | pmid = 8628686 | pmc = 145823 | doi = 10.1093/nar/24.8.1515}}</ref> and its equivalent in RNA is called RAGA.<ref name="notredame3">{{cite journal | doi = 10.1093/nar/25.22.4570 |vauthors=Notredame C, O'Brien EA, Higgins DG | year = 1997 | title = RAGA: RNA sequence alignment by genetic algorithm | journal = Nucleic Acids Res | volume = 25 | issue =22 22| pages = 4570–80 | pmid = 9358168 | pmc = 147093 }}</ref>
 
The technique of [[simulated annealing]], by which an existing MSA produced by another method is refined by a series of rearrangements designed to find better regions of alignment space than the one the input alignment already occupies. Like the genetic algorithm method, simulated annealing maximizes an objective function like the sum-of-pairs function. Simulated annealing uses a metaphorical "temperature factor" that determines the rate at which rearrangements proceed and the likelihood of each rearrangement; typical usage alternates periods of high rearrangement rates with relatively low likelihood (to explore more distant regions of alignment space) with periods of lower rates and higher likelihoods to more thoroughly explore local minima near the newly "colonized" regions. This approach has been implemented in the program MSASA (Multiple Sequence Alignment by Simulated Annealing).<ref name="kim">{{cite journal |vauthors=Kim J, Pramanik S, Chung MJ | year = 1994 | title = Multiple sequence alignment using simulated annealing | journal = Comput Appl Biosci | volume = 10 | issue = 4| pages = 419–26 | pmid = 7804875 | doi=10.1093/bioinformatics/10.4.419}}</ref>
 
The technique of [[simulated annealing]], by which an existing MSA produced by another method is refined by a series of rearrangements designed to find better regions of alignment space than the one the input alignment already occupies. Like the genetic algorithm method, simulated annealing maximizes an objective function like the sum-of-pairs function. Simulated annealing uses a metaphorical "temperature factor" that determines the rate at which rearrangements proceed and the likelihood of each rearrangement; typical usage alternates periods of high rearrangement rates with relatively low likelihood (to explore more distant regions of alignment space) with periods of lower rates and higher likelihoods to more thoroughly explore local minima near the newly "colonized" regions. This approach has been implemented in the program MSASA (Multiple Sequence Alignment by Simulated Annealing).<ref name="kim">{{cite journal |vauthors=Kim J, Pramanik S, Chung MJ | year = 1994 | title = Multiple sequence alignment using simulated annealing | journal =Computer ComputApplications Applin Bioscithe Biosciences | volume = 10 | issue =4 4| pages = 419–26 | pmid = 7804875 | doi=10.1093/bioinformatics/10.4.419}}</ref>
=== Mathematical programming and exact solution algorithms ===
 
=== Mathematical programming and exact solution algorithms ===
[[Mathematical optimization|Mathematical programming]] and in particular [[Mixedmixed integer programming]] models are another approach to solve MSA problems. The advantage of such optimization models is that they can be used to find the optimal MSA solution more efficiently compared to the traditional DP approach. This is due in part, to the applicability of decomposition techniques for mathematical programs, where the MSA model is decomposed into smaller parts and iteratively solved until the optimal solution is found. Example algorithms used to solve mixed integer programming models of MSA include [[branch and price]] <ref name="althaus2006">{{cite journal | doi = 10.1007/s10107-005-0659-3 |vauthors=Althaus E, Caprara A, Lenhof HP, Reinert K | year = 2006 | title = A branch-and-cut algorithm for multiple sequence alignment | journal = Mathematical Programming | volume = 105 | issue =2–3 2–3| pages = 387–425 |s2cid=17715172 }}</ref> and [[Benders decomposition]].<ref name="hosseininasab"/> Although exact approaches are computationally slow compared to heuristic algorithms for MSA, they are guaranteed to reach the optimal solution eventually, even for large-size problems.
 
===Simulated quantum computing===
In January 2017, [[D-Wave Systems]] announced that its qbsolv open-source [[quantum computing]] software had been successfully used to find a faster solution to the MSA problem.<ref>[{{Cite web |url=http://www.dwavesys.com/press-releases/d-wave-initiates-open-quantum-software-environment |title=D-Wave Initiates Open Quantum Software Environment 11 January 2017] |access-date=20 January 2017 |archive-date=8 March 2021 |archive-url=https://web.archive.org/web/20210308155608/https://www.dwavesys.com/press-releases/d-wave-initiates-open-quantum-software-environment |url-status=dead}}</ref>
 
==Alignment visualization and quality control==
The necessary use of heuristics for multiple alignment means that for an arbitrary set of proteins, there is always a good chance that an alignment will contain errors. For example, an evaluation of several leading alignment programs using the [[List of sequence alignment software#Benchmarking|BAliBase benchmark]] found that at least 24% of all pairs of aligned amino acids were incorrectly aligned.<ref name="nuin2006">{{cite journal |vauthors=Nuin PA, Wang Z, Tillier ER |year=2006 |title=The accuracy of several multiple sequence alignment programs for proteins |journal=BMC Bioinformatics |doi=10.1186/1471-2105-7-471 |pmid=17062146 |volume=7 |pmc=1633746 |pages=471 |doi-access=free}}</ref> These errors can arise because of unique insertions into one or more regions of sequences, or through some more complex evolutionary process leading to proteins that do not align easily by sequence alone. As the number of sequence and their divergence increases many more errors will be made simply because of the heuristic nature of MSA algorithms. [[List of alignment visualization software|Multiple sequence alignment viewers]] enable alignments to be visually reviewed, often by inspecting the quality of alignment for annotated functional sites on two or more sequences. Many also enable the alignment to be edited to correct these (usually minor) errors, in order to obtain an optimal 'curated' alignment suitable for use in phylogenetic analysis or comparative modeling.<ref>{{cite web | title=Manual editing and adjustment of MSAs | publisher=European Molecular Biology Laboratory | year=2007 | url=http://www.embl.de/~seqanal/MSAcambridgeGenetics2007/MSAmanualAdjustments/MSAmanualAdjustments.html | access-date=March 7, 2010 | archive-url=https://web.archive.org/web/20150924000135/http://www.embl.de/~seqanal/MSAcambridgeGenetics2007/MSAmanualAdjustments/MSAmanualAdjustments.html | archive-date=September 24, 2015 | url-status=dead }}</ref>
 
However, as the number of sequences increases and especially in genome-wide studies that involve many MSAs it is impossible to manually curate all alignments. Furthermore, manual curation is subjective. And finally, even the best expert cannot confidently align the more ambiguous cases of highly diverged sequences. In such cases it is common practice to use automatic procedures to exclude unreliably aligned regions from the MSA. For the purpose of phylogeny reconstruction (see below) the Gblocks program is widely used to remove alignment blocks suspect of low quality, according to various cutoffs on the number of gapped sequences in alignment columns.<ref name="castresana2000">{{cite journal | vauthors = Castresana J | title = Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis | journal =Molecular Mol.Biology Biol.and Evol.Evolution | volume = 17 | issue = 4 | pages = 540–52 | date = April 2000 | pmid = 10742046 | doi = 10.1093/oxfordjournals.molbev.a026334 | doi-access = free }}</ref> However, these criteria may excessively filter out regions with insertion/deletion events that may still be aligned reliably, and these regions might be desirable for other purposes such as detection of positive selection. A few alignment algorithms output site-specific scores that allow the selection of high-confidence regions. Such a service was first offered by the SOAP program,<ref name="loytynojaMilinkovitch2001">{{cite journal | vauthors = Löytynoja A, Milinkovitch MC | title = SOAP, cleaning multiple alignments from unstable blocks | journal = Bioinformatics | volume = 17 | issue = 6 | pages = 573–4 | date = June 2001 | pmid = 11395440 | doi = 10.1093/bioinformatics/17.6.573 | doi-access = free }}</ref> which tests the robustness of each column to perturbation in the parameters of the popular alignment program CLUSTALW. The T-Coffee program<ref name=poirotOTooleNotredame2003>{{cite journal | vauthors = Poirot O, O'Toole E, Notredame C | title = Tcoffee@igs: A web server for computing, evaluating and combining multiple sequence alignments | journal = Nucleic Acids Res. | volume = 31 | issue = 13 | pages = 3503–6 | date = July 2003 | pmid = 12824354 | pmc = 168929 | doi = 10.1093/nar/gkg522 }}</ref> uses a library of alignments in the construction of the final MSA, and its output MSA is colored according to confidence scores that reflect the agreement between different alignments in the library regarding each aligned residue. Its extension, [http://tcoffee.crg.cat/tcsTransitive TCS]Consistency :Score ('''T'''ransitive '''C'''onsistency '''S'''coreTCS), uses T-Coffee [[Library (computing)|libraries]] of pairwise alignments to evaluate any third party MSA. Pairwise projections can be produced using fast or slow methods, thus allowing a trade-off between speed and accuracy.<ref name=TCS2014MBE>{{cite journal|last=Chang|first=JM|author2=Di Tommaso, P | author3 = Notredame, C|title=TCS: A New Multiple Sequence Alignment Reliability Measure to Estimate Alignment Accuracy and Improve Phylogenetic Tree Reconstruction.|journal=Molecular Biology and Evolution|date=JunJune 2014|volume=31|issue=6|pages=1625–37|doi=10.1093/molbev/msu117|pmid=24694831|doi-access=free}}</ref><ref name=TCS_2015_NAR>{{cite journal | vauthors = Chang JM, Di Tommaso P, Lefort V, Gascuel O, Notredame C | title = TCS: a web server for multiple sequence alignment evaluation and phylogenetic reconstruction | journal = Nucleic Acids Res. | volume = 43 | issue = W1 | pages = W3–6 | date = July 2015 | pmid = 25855806 | pmc = 4489230 | doi = 10.1093/nar/gkv310 }}</ref> Another alignment program that can output an MSA with confidence scores is FSA,<ref name=bradley2009>{{cite journal | vauthors = Bradley RK, Roberts A, Smoot M, Juvekar S, Do J, Dewey C, Holmes I, Pachter L | title = Fast statistical alignment | journal = PLOS Comput. Biol. | volume = 5 | issue = 5 | pages = e1000392 | date = May 2009 | pmid = 19478997 | pmc = 2684580 | doi = 10.1371/journal.pcbi.1000392 | bibcode = 2009PLSCB...5E0392B |doi-access=free}}</ref> which uses a [[statistical model]] that allows calculation of the uncertainty in the alignment. The HoT (Heads-Or-Tails) score can be used as a measure of site-specific alignment uncertainty due to the existence of multiple co-optimal solutions.<ref name=landanGraur2008>{{cite book | vauthors = Landan G, Graur D | title = Biocomputing 2008 | chapter = Local reliability measures from sets of co-optimal multiple sequence alignments | journal = Pac Symp Biocomput | pages = 15–24 | date = 2008 | pmid = 18229673 | doi = 10.1142/9789812776136_0003 |isbn = 978-981-277-608-2 }}</ref> The GUIDANCE program<ref name="penn2010">{{cite journal | vauthors = Penn O, Privman E, Landan G, Graur D, Pupko T | title = An alignment confidence score capturing robustness to guide tree uncertainty | journal =Molecular Mol.Biology Biol.and Evol.Evolution | volume = 27 | issue = 8 | pages = 1759–67 | date = August 2010 | pmid = 20207713 | pmc = 2908709 | doi = 10.1093/molbev/msq066 }}</ref> calculates a similar site-specific confidence measure based on the robustness of the alignment to uncertainty in the guide tree that is used in progressive alignment programs. An alternative, more statistically justified approach to assess alignment uncertainty is the use of probabilistic evolutionary models for joint estimation of phylogeny and alignment. A Bayesian approach allows calculation of posterior probabilities of estimated phylogeny and alignment, which is a measure of the confidence in these estimates. In this case, a [[posterior probability]] can be calculated for each site in the alignment. Such an approach was implemented in the program BAli-Phy.<ref name="redelingsSuchard2005">{{cite journal | vauthors = Redelings BD, Suchard MA | title = Joint Bayesian estimation of alignment and phylogeny | journal = Syst. Biol. | volume = 54 | issue = 3 | pages = 401–18 | date = June 2005 | pmid = 16012107 | doi = 10.1080/10635150590947041 | doi-access = free }}</ref>
 
There are free programs available for visualization of multiple sequence alignments, for example [[Jalview]] and [[UGENE]].
 
==Phylogenetic use==
Multiple sequence alignments can be used to create a [[phylogenetic tree]].<ref name="Budd_2009">{{Cite web | title=Multiple sequence alignment exercises and demonstrations | last=Budd | first=Aidan | publisher=European Molecular Biology Laboratory | date=10 February 2009 | access-date=June 30, 2010 | url=http://www.embl.de/~seqanal/courses/commonCourseContent/commonMsaExercises.html | archive-url=https://web.archive.org/web/20120305164700/http://www.embl.de/~seqanal/courses/commonCourseContent/commonMsaExercises.html | archive-date=5 March 2012 | url-status=dead }}</ref> This is made possible by two reasons. The first is because functional domains that are known in annotated sequences can be used for alignment in non-annotated sequences. The other is that conserved regions known to be functionally important can be found. This makes it possible for multiple sequence alignments to be used to analyze and find evolutionary relationships through homology between sequences. Point mutations and insertion or deletion events (called indels) can be detected.
 
Multiple sequence alignments can also be used to identify functionally important sites, such as binding sites, active sites, or sites corresponding to other key functions, by locating conserved domains. When looking at multiple sequence alignments, it is useful to consider different aspects of the sequences when comparing sequences. These aspects include identity, similarity, and homology. Identity means that the sequences have identical residues at their respective positions. On the other hand, similarity has to do with the sequences being compared having similar residues quantitatively. For example, in terms of nucleotide sequences, pyrimidines are considered similar to each other, as are purines. Similarity ultimately leads to homology, in that the more similar sequences are, the closer they are to being homologous. This similarity in sequences can then go on to help find common ancestry.<ref name="Budd_2009"/>
 
==See also==
*[[Alignment-free sequence analysis]]
*[[Cladistics]]
*[[Generalized tree alignment]]
*[[List of alignment visualization software|Multiple sequence alignment viewers]]
*[[PANDIT (database)|PANDIT]], a biological database covering protein domains
*[[Phylogenetics]]
*[[Sequence alignment software]]
*[[List of alignment visualization software|Multiple sequence alignment viewers]]
*[[Structural alignment]]
*[[Alignment-free sequence analysis]]
 
==References==
Line 143 ⟶ 138:
 
===Survey articles===
*{{Cite book | first = L. | last = Duret |author2=S. Abdeddaim | year = 2000 | chapter = Multiple alignment for structural functional or phylogenetic analyses of homologous sequences | editor = D. Higgins and W. Taylor | title = Bioinformatics sequence structure and databanks | ___location = Oxford | publisher = Oxford University Press}}
*{{Cite journal | first = C. | last = Notredame | year = 2002 | title = Recent progresses in multiple sequence alignment: a survey | journal = Pharmacogenomics | volume =3 3| issue = 1 | pages = 131–144 | doi = 10.1517/14622416.3.1.131 | pmid = 11966409 }}
*{{Cite journal | first1 = J. D. | last1 = Thompson | first2 = F. | last2 = Plewniak | first3 = O. | last3 = Poch | year = 1999 | title = A comprehensive comparison of multiple sequence alignment programs | journal = Nucleic Acids Research | volume = 27 | issue = 13 | pages= 12682–2690 | doi = 10.1093/nar/27.13.2682 | pmid = 10373585 | pmc = 148477 }}
*{{Cite journal | first1 = I.M. | last1 = Wallace | first2 = G. | last2 = Blackshields | first3 = D.G. | last3 = Higgins | year = 2005 | title = Multiple sequence alignments | journal = Curr Opin Struct Biol | volume = 15 | issue = 3 | pages = 261–266 | doi = 10.1016/j.sbi.2005.04.002 | pmid = 15963889}}
*{{Cite journal | first = C | last = Notredame | year = 2007 | title = Recent Evolutions of Multiple Sequence Alignment Algorithms | journal = PLOS Computational Biology | volume = 3 | issue = 8 | page = e123 | doi = 10.1371/journal.pcbi.0030123 | pmid = 17784778 | pmc =1963500 1963500| bibcode = 2007PLSCB...3..123N |doi-access=free}}
 
==External links==
* [http://www.expasy.org/tools/#align ExPASy sequence alignment tools]
* [https://web.archive.org/web/20100107231637/http://www.techfak.uni-bielefeld.de/bcd/Curric/MulAli/welcome.html Archived Multiple Alignment Resource Page]&nbsp;— from the Virtual School of Natural Sciences
* [https://web.archive.org/web/20060909003117/http://pbil.univ-lyon1.fr/alignment.html Tools for Multiple Alignments]&nbsp;— from Pôle Bioinformatique Lyonnais
* [http://www.clustal.org/ An entry point to clustal servers and information]
* [http://www.tcoffee.org/ An entry point to the main T-Coffee servers]
* [http://www.mergealign.com/ An entry point to the main MergeAlign server and information]
*European Bioinformatics Institute servers:
** [http://www.ebi.ac.uk/Tools/clustalw2/ ClustalW2]&nbsp;— general purpose multiple sequence alignment program for DNA or proteins.
** [http://www.ebi.ac.uk/Tools/muscle/ Muscle]&nbsp;— MUltiple Sequence Comparison by Log-Expectation
** [http://www.ebi.ac.uk/Tools/t-coffee/ T-coffee]&nbsp;— multiple sequence alignment.
** [http://www.ebi.ac.uk/Tools/mafft/ MAFFT]&nbsp;— Multiple Alignment using Fast Fourier Transform
** [http://www.ebi.ac.uk/Tools/kalign/ KALIGN]&nbsp;— a fast and accurate multiple sequence alignment algorithm.
 
===Lecture notes, tutorials, and courses===
* [http://cmb.molgen.mpg.de/ Multiple sequence alignment lectures]&nbsp;— from the Max Planck Institute for Molecular Genetics
* [https://web.archive.org/web/20120305164700/http://www.embl.de/~seqanal/courses/commonCourseContent/commonMsaExercises.html Lecture Notes and practical exercises] on multiple sequence alignments at the [[European Molecular Biology Laboratory|EMBL]] (EMBL)
* [https://web.archive.org/web/20110904184621/http://www.avatar.se/molbioinfo2001/index.html Molecular Bioinformatics Lecture Notes]
* [https://web.archive.org/web/20090821192233/http://bioinf.may.ie/school02/notes.html Molecular Evolution and Bioinformatics Lecture Notes]
 
{{Good article}}
 
{{DEFAULTSORT:Multiple Sequence Alignment}}
Line 175 ⟶ 168:
[[Category:Computational phylogenetics]]
[[Category:Markov models]]
[[Category:NP-complete problems]]