Multiple sequence alignment: Difference between revisions

Content deleted Content added
top: Renaming Algorithm -> Problem statement. Deleting duplicating explanations from the intro.
top: Moving the dynamic programming approach to the list of methods. Removing "Application in code" section which does not refer to applications or code.
Line 45:
When determining the best suited alignments for each MSA, a ''trace'' is usually generated. A trace is a set of ''realized'', or corresponding and aligned, vertices that has a specific weight based on the edges that are selected between corresponding vertices. When choosing traces for a set of sequences it is necessary to choose a trace with a maximum weight to get the best alignment of the sequences.
 
==ApplicationAlignment in codemethods==
 
There are various alignment methods used within multiple sequence to maximize scores and correctness of alignments. Each is usually based on a certain heuristic with an insight into the evolutionary process. Most try to replicate evolution to get the most realistic alignment possible to best predict relations between sequences.
===Dynamic programming and computational complexity===
A direct method for producing an MSA uses the [[dynamic programming]] technique to identify the globally optimal alignment solution. For proteins, this method usually involves two sets of parameters: a [[gap penalty]] and a [[substitution matrix]] assigning scores or probabilities to the alignment of each possible pair of amino acids based on the similarity of the amino acids' chemical properties and the evolutionary probability of the mutation. For nucleotide sequences, a similar gap penalty is used, but a much simpler substitution matrix, wherein only identical matches and mismatches are considered, is typical. The scores in the substitution matrix may be either all positive or a mix of positive and negative in the case of a global alignment, but must be both positive and negative, in the case of a local alignment.<ref>{{cite web | title=Help with matrices used in sequence comparison tools | publisher=European Bioinformatics Institute | url=http://www.ebi.ac.uk/help/matrix.html | accessdate=March 3, 2010 | url-status=dead | archive-url=https://web.archive.org/web/20100311140200/http://www.ebi.ac.uk/help/matrix.html | archive-date=March 11, 2010 }}</ref>
 
=== Dynamic programming and computational complexity===
For ''n'' individual sequences, the naive method requires constructing the ''n''-dimensional equivalent of the matrix formed in standard pairwise [[sequence alignment]]. The search space thus increases exponentially with increasing ''n'' and is also strongly dependent on sequence length. Expressed with the [[big O notation]] commonly used to measure [[Computational complexity theory|computational complexity]], a [[Naïve algorithm|naïve]] MSA takes ''O(Length<sup>Nseqs</sup>)'' time to produce. To find the global optimum for ''n'' sequences this way has been shown to be an [[NP complete|NP-complete]] problem.<ref name="wang">{{cite journal | doi = 10.1089/cmb.1994.1.337 |vauthors=Wang L, Jiang T | year = 1994 | title = On the complexity of multiple sequence alignment | journal = J Comput Biol | volume = 1 | issue = 4| pages = 337–348 | pmid = 8790475 |citeseerx=10.1.1.408.894 }}</ref><ref name="just">{{cite journal | doi = 10.1089/106652701753307511 | author = Just W | year = 2001 | title = Computational complexity of multiple sequence alignment with SP-score | journal = J Comput Biol | volume = 8 | issue = 6| pages = 615–23 | pmid = 11747615 | citeseerx = 10.1.1.31.6382 }}</ref><ref name=elias>{{cite journal | journal=J Comput Biol | volume=13 | pages=1323–1339 | year=2006 | author=Elias, Isaac | title=Settling the intractability of multiple alignment | pmid=17037961 | doi =10.1089/cmb.2006.13.1323 | issue=7 | citeseerx=10.1.1.6.256 }}</ref> In 1989, based on Carrillo-Lipman Algorithm,<ref name="carrillo">{{cite journal |vauthors=Carrillo H, Lipman DJ | year = 1988 | title = The Multiple Sequence Alignment Problem in Biology | url = https://zenodo.org/record/1236134| journal = SIAM Journal on Applied Mathematics | volume = 48 | issue = 5| pages = 1073–1082 | doi=10.1137/0148063}}</ref> Altschul introduced a practical method that uses pairwise alignments to constrain the n-dimensional search space.<ref name="altschul">{{cite journal | doi = 10.1073/pnas.86.12.4412 |vauthors=Lipman DJ, Altschul SF, Kececioglu JD | year = 1989 | title = A tool for multiple sequence alignment | journal = Proc Natl Acad Sci U S A | volume = 86 | issue = 12| pages = 4412–4415 | pmid = 2734293 | pmc = 287279 |bibcode=1989PNAS...86.4412L }}</ref> In this approach pairwise dynamic programming alignments are performed on each pair of sequences in the query set, and only the space near the n-dimensional intersection of these alignments is searched for the n-way alignment. The MSA program optimizes the sum of all of the pairs of characters at each position in the alignment (the so-called ''sum of pair'' score) and has been implemented in a software program for constructing multiple sequence alignments.<ref>{{cite web | title=Genetic analysis software | publisher=National Center for Biotechnology Information | url=https://www.ncbi.nlm.nih.gov/CBBresearch/Schaffer/msa.html | accessdate=March 3, 2010}}</ref> In 2019, Hosseininasab and van Hoeve showed that by using decision diagrams, MSA may be modeled in polynomial space complexity.<ref name="hosseininasab"/>
A direct method for producing an MSA uses the [[dynamic programming]] technique to identify the globally optimal alignment solution. For proteins, this method usually involves two sets of parameters: a [[gap penalty]] and a [[substitution matrix]] assigning scores or probabilities to the alignment of each possible pair of amino acids based on the similarity of the amino acids' chemical properties and the evolutionary probability of the mutation. For nucleotide sequences, a similar gap penalty is used, but a much simpler substitution matrix, wherein only identical matches and mismatches are considered, is typical. The scores in the substitution matrix may be either all positive or a mix of positive and negative in the case of a global alignment, but must be both positive and negative, in the case of a local alignment.<ref>{{cite web | title=Help with matrices used in sequence comparison tools | publisher=European Bioinformatics Institute | url=http://www.ebi.ac.uk/help/matrix.html | accessdate=March 3, 2010 | url-status=dead | archive-url=https://web.archive.org/web/20100311140200/http://www.ebi.ac.uk/help/matrix.html | archive-date=March 11, 2010|accessdate=March 3, 2010|publisher=European Bioinformatics Institute}}</ref>
 
For ''n'' individual sequences, the naive method requires constructing the ''n''-dimensional equivalent of the matrix formed in standard pairwise [[sequence alignment]]. The search space thus increases exponentially with increasing ''n'' and is also strongly dependent on sequence length. Expressed with the [[big O notation]] commonly used to measure [[Computational complexity theory|computational complexity]], a [[Naïve algorithm|naïve]] MSA takes ''O(Length<sup>Nseqs</sup>)'' time to produce. To find the global optimum for ''n'' sequences this way has been shown to be an [[NP complete|NP-complete]] problem.<ref name="wang">{{cite journal | doi = 10.1089/cmb.1994.1.337 |vauthors=Wang L, Jiang T | year = 1994 | title = On the complexity of multiple sequence alignment | journal = J Comput Biol | volume = 1 | issue = 4| pages = 337–348 | pmid = 8790475 |citeseerx=10.1.1.408.894 |doi=10.1089/cmb.1994.1.337|pmid=8790475}}</ref><ref name="just">{{cite journal | doi = 10.1089/106652701753307511 | author = Just W | year = 2001 | title = Computational complexity of multiple sequence alignment with SP-score | journal = J Comput Biol | volume = 8 | issue = 6| pages = 615–23 | pmid = 11747615 | citeseerx = 10.1.1.31.6382 |doi=10.1089/106652701753307511|pmid=11747615}}</ref><ref name="elias">{{cite journal | journalauthor=J Comput BiolElias, Isaac| volume=13 | pages=1323–1339 | year=2006 | author=Elias, Isaac | title=Settling the intractability of multiple alignment | pmidjournal=17037961J Comput Biol| doi volume=10.1089/cmb.2006.13.1323 | issue=7 | pages=1323–1339|citeseerx=10.1.1.6.256 |doi=10.1089/cmb.2006.13.1323|pmid=17037961}}</ref> In 1989, based on Carrillo-Lipman Algorithm,<ref name="carrillo">{{cite journal |vauthors=Carrillo H, Lipman DJ | year = 1988 | title = The Multiple Sequence Alignment Problem in Biology | url = https://zenodo.org/record/1236134| journal = SIAM Journal on Applied Mathematics | volume = 48 | issue = 5| pages = 1073–1082 | doi=10.1137/0148063}}</ref> Altschul introduced a practical method that uses pairwise alignments to constrain the n-dimensional search space.<ref name="altschul">{{cite journal | doi = 10.1073/pnas.86.12.4412 |vauthors=Lipman DJ, Altschul SF, Kececioglu JD | year = 1989 | title = A tool for multiple sequence alignment | journal = Proc Natl Acad Sci U S A | volume = 86 | issue = 12| pages = 4412–4415 | pmid = 2734293 | pmc = 287279 |bibcode=1989PNAS...86.4412L |doi=10.1073/pnas.86.12.4412|pmc=287279|pmid=2734293}}</ref> In this approach pairwise dynamic programming alignments are performed on each pair of sequences in the query set, and only the space near the n-dimensional intersection of these alignments is searched for the n-way alignment. The MSA program optimizes the sum of all of the pairs of characters at each position in the alignment (the so-called ''sum of pair'' score) and has been implemented in a software program for constructing multiple sequence alignments.<ref>{{cite web | title=Genetic analysis software | publisher=National Center for Biotechnology Information | url=https://www.ncbi.nlm.nih.gov/CBBresearch/Schaffer/msa.html | accessdate=March 3, 2010|publisher=National Center for Biotechnology Information}}</ref> In 2019, Hosseininasab and van Hoeve showed that by using decision diagrams, MSA may be modeled in polynomial space complexity.<ref name="hosseininasab" />
==Alignment methods==
 
There are various alignment methods used within multiple sequence to maximize scores and correctness of alignments. Each is usually based on a certain heuristic with an insight into the evolutionary process. Most try to replicate evolution to get the most realistic alignment possible to best predict relations between sequences.
 
=== Progressive alignment construction ===
The most widely used approach to multiple sequence alignments uses a heuristic search known as progressive technique (also known as the hierarchical or tree method) developed by Da-Fei Feng and Doolittle in 1987.<ref name="feng1987progressive">{{cite journal |vauthors=Feng DF, Doolittle RF |year=1987 |title=Progressive sequence alignment as a prerequisitetto correct phylogenetic trees |journal=J Mol Evol |volume=25 |issue=4 |pages=351–360 |pmid=3118049|doi=10.1007/BF02603120 |bibcode=1987JMolE..25..351F |s2cid=6345432 }}</ref> Progressive alignment builds up a final MSA by combining pairwise alignments beginning with the most similar pair and progressing to the most distantly related. All progressive alignment methods require two stages: a first stage in which the relationships between the sequences are represented as a [[phylogenetic tree|tree]], called a ''guide tree'', and a second step in which the MSA is built by adding the sequences sequentially to the growing MSA according to the guide tree. The initial ''guide tree'' is determined by an efficient [[Cluster analysis|clustering]] method such as [[neighbor-joining]] or [[UPGMA]], and may use distances based on the number of identical two-letter sub-sequences (as in [[FASTA]] rather than a dynamic programming alignment).<ref name="mount"/>