Multiple sequence alignment: Difference between revisions

Content deleted Content added
m Reverted edits by 186.5.236.226 (talk) (HG) (3.4.10)
Citation bot (talk | contribs)
Add: doi-access. | Use this bot. Report bugs. | Suggested by Headbomb | Linked from Wikipedia:WikiProject_Academic_Journals/Journals_cited_by_Wikipedia/Sandbox | #UCB_webform_linked 55/690
Line 3:
'''Multiple sequence alignment''' ('''MSA''') may refer to the process or the result of [[sequence alignment]] of three or more [[biological sequence]]s, generally [[protein]], [[DNA]], or [[RNA]]. In many cases, the input set of query sequences are assumed to have an [[evolutionary]] relationship by which they share a linkage and are descended from a common ancestor. From the resulting MSA, sequence [[homology (biology)|homology]] can be inferred and [[molecular phylogeny|phylogenetic analysis]] can be conducted to assess the sequences' shared evolutionary origins. Visual depictions of the alignment as in the image at right illustrate [[mutation]] events such as point mutations (single [[amino acid]] or [[nucleotide]] changes) that appear as differing characters in a single alignment column, and insertion or deletion mutations ([[indel]]s or gaps) that appear as hyphens in one or more of the sequences in the alignment. Multiple sequence alignment is often used to assess sequence [[conservation (genetics)|conservation]] of [[protein ___domain]]s, [[tertiary structure|tertiary]] and [[secondary structure|secondary]] structures, and even individual amino acids or nucleotides.
 
Computational [[algorithm]]s are used to produce and analyse the MSAs due to the difficulty and intractability of manually processing the sequences given their biologically-relevant length. MSAs require more sophisticated methodologies than [[sequence alignment|pairwise alignment]] because they are more [[Computational complexity theory|computationally complex]]. Most multiple sequence alignment programs use [[heuristic]] methods rather than [[global optimization]] because identifying the optimal alignment between more than a few sequences of moderate length is prohibitively computationally expensive. On the other hand, heuristic methods generally fail to give guarantees on the solution quality, with heuristic solutions shown to be often far below the optimal solution on benchmark instances.<ref name="thompson2011">{{cite journal | doi = 10.1371/journal.pone.0018093|vauthors= Thompson JD, Linard B, Lecompte O, Poch O | year = 2011 | title = A comprehensive benchmark study of multiple sequence alignment methods: current challenges and future perspectives | journal = PLOS ONE | volume = 6 | issue = 3| pages = e18093| pmid = 21483869| pmc = 3069049|bibcode= 2011PLoSO...618093T |doi-access= free }}</ref><ref name="nuin2006" /><ref name="hosseininasab">{{cite journal | doi = 10.1287/ijoc.2019.0937 |vauthors=Hosseininasab A, van Hoeve WJ | year = 2019 | title = Exact Multiple Sequence Alignment by Synchronized Decision Diagrams | journal = INFORMS Journal on Computing }}</ref>
 
==Problem statement==
Line 52:
A direct method for producing an MSA uses the [[dynamic programming]] technique to identify the globally optimal alignment solution. For proteins, this method usually involves two sets of parameters: a [[gap penalty]] and a [[substitution matrix]] assigning scores or probabilities to the alignment of each possible pair of amino acids based on the similarity of the amino acids' chemical properties and the evolutionary probability of the mutation. For nucleotide sequences, a similar gap penalty is used, but a much simpler substitution matrix, wherein only identical matches and mismatches are considered, is typical. The scores in the substitution matrix may be either all positive or a mix of positive and negative in the case of a global alignment, but must be both positive and negative, in the case of a local alignment.<ref>{{cite web|title=Help with matrices used in sequence comparison tools|url=http://www.ebi.ac.uk/help/matrix.html|url-status=dead|archive-url=https://web.archive.org/web/20100311140200/http://www.ebi.ac.uk/help/matrix.html|archive-date=March 11, 2010|access-date=March 3, 2010|publisher=European Bioinformatics Institute}}</ref>
 
For ''n'' individual sequences, the naive method requires constructing the ''n''-dimensional equivalent of the matrix formed in standard pairwise [[sequence alignment]]. The search space thus increases exponentially with increasing ''n'' and is also strongly dependent on sequence length. Expressed with the [[big O notation]] commonly used to measure [[Computational complexity theory|computational complexity]], a [[Naïve algorithm|naïve]] MSA takes ''O(Length<sup>Nseqs</sup>)'' time to produce. To find the global optimum for ''n'' sequences this way has been shown to be an [[NP complete|NP-complete]] problem.<ref name="wang">{{cite journal|vauthors=Wang L, Jiang T|year=1994|title=On the complexity of multiple sequence alignment|journal=J Comput Biol|volume=1|issue=4|pages=337–348|citeseerx=10.1.1.408.894|doi=10.1089/cmb.1994.1.337|pmid=8790475}}</ref><ref name="just">{{cite journal|author=Just W|year=2001|title=Computational complexity of multiple sequence alignment with SP-score|journal=J Comput Biol|volume=8|issue=6|pages=615–23|citeseerx=10.1.1.31.6382|doi=10.1089/106652701753307511|pmid=11747615}}</ref><ref name="elias">{{cite journal|author=Elias, Isaac|year=2006|title=Settling the intractability of multiple alignment|journal=J Comput Biol|volume=13|issue=7|pages=1323–1339|citeseerx=10.1.1.6.256|doi=10.1089/cmb.2006.13.1323|pmid=17037961}}</ref> In 1989, based on Carrillo-Lipman Algorithm,<ref name="carrillo">{{cite journal|vauthors=Carrillo H, Lipman DJ|year=1988|title=The Multiple Sequence Alignment Problem in Biology|url=https://zenodo.org/record/1236134|journal=SIAM Journal on Applied Mathematics|volume=48|issue=5|pages=1073–1082|doi=10.1137/0148063}}</ref> Altschul introduced a practical method that uses pairwise alignments to constrain the n-dimensional search space.<ref name="altschul">{{cite journal|vauthors=Lipman DJ, Altschul SF, Kececioglu JD|year=1989|title=A tool for multiple sequence alignment|journal=Proc Natl Acad Sci U S A|volume=86|issue=12|pages=4412–4415|bibcode=1989PNAS...86.4412L|doi=10.1073/pnas.86.12.4412|pmc=287279|pmid=2734293|doi-access=free}}</ref> In this approach pairwise dynamic programming alignments are performed on each pair of sequences in the query set, and only the space near the n-dimensional intersection of these alignments is searched for the n-way alignment. The MSA program optimizes the sum of all of the pairs of characters at each position in the alignment (the so-called ''sum of pair'' score) and has been implemented in a software program for constructing multiple sequence alignments.<ref>{{cite web|title=Genetic analysis software|url=https://www.ncbi.nlm.nih.gov/CBBresearch/Schaffer/msa.html|access-date=March 3, 2010|publisher=National Center for Biotechnology Information}}</ref> In 2019, Hosseininasab and van Hoeve showed that by using decision diagrams, MSA may be modeled in polynomial space complexity.<ref name="hosseininasab" />
 
=== Progressive alignment construction ===
Line 89:
===Phylogeny-aware methods===
[[File:Non-homologous exon alignment.jpg|thumb|450px|Non-homologous exon alignment by an iterative method (a), and by a phylogeny-aware method (b)]]
Most multiple sequence alignment methods try to minimize the number of [[Indel|insertions/deletions]] (gaps) and, as a consequence, produce compact alignments. This causes several problems if the sequences to be aligned contain non-[[Homology (biology)#Sequence homology|homologous]] regions, if gaps are informative in a [[phylogeny]] analysis. These problems are common in newly produced sequences that are poorly annotated and may contain [[Frameshift|frame-shifts]], wrong [[Protein ___domain|domains]] or non-homologous [[RNA splicing|spliced]] [[exons]]. The first such method was developed in 2005 by Löytynoja and Goldman.<ref name="Loytynoja-2005">{{Cite journal | last1 = Loytynoja | first1 = A. | title = An algorithm for progressive multiple alignment of sequences with insertions | doi = 10.1073/pnas.0409137102 | journal = Proceedings of the National Academy of Sciences | volume = 102 | issue = 30 | pages = 10557–10562 | year = 2005 | pmid = 16000407| pmc = 1180752| bibcode = 2005PNAS..10210557L | doi-access = free }}</ref> The same authors released a software package called ''PRANK'' in 2008.<ref name="Loytynoja-2008">{{cite journal | vauthors = Löytynoja A, Goldman N | title = Phylogeny-aware gap placement prevents errors in sequence alignment and evolutionary analysis | journal = Science | volume = 320 | issue = 5883 | pages = 1632–5 | date = June 2008 | pmid = 18566285 | doi = 10.1126/science.1158395 | bibcode = 2008Sci...320.1632L | s2cid = 5211928 }}</ref> PRANK improves alignments when insertions are present. Nevertheless, it runs slowly compared to progressive and/or iterative methods which have been developed for several years.
 
In 2012, two new phylogeny-aware tools appeared. One is called ''PAGAN'' that was developed by the same team as PRANK.<ref name="Loytynoja-2012">{{cite journal | vauthors = Löytynoja A, Vilella AJ, Goldman N | title = Accurate extension of multiple sequence alignments using a phylogeny-aware graph algorithm | journal = Bioinformatics | volume = 28 | issue = 13 | pages = 1684–91 | date = July 2012 | pmid = 22531217 | pmc = 3381962 | doi = 10.1093/bioinformatics/bts198 }}</ref> The other is ''ProGraphMSA'' developed by Szalkowski.<ref name="Szalkowski-2012">{{cite journal | vauthors = Szalkowski AM | title = Fast and robust multiple sequence alignment with phylogeny-aware gap placement | journal = BMC Bioinformatics | volume = 13 | pages = 129 | date = June 2012 | pmid = 22694311 | pmc = 3495709 | doi = 10.1186/1471-2105-13-129 }}</ref> Both software packages were developed independently but share common features, notably the use of [[Directed acyclic graph#Partial orders and topological ordering|graph algorithms]] to improve the recognition of non-homologous regions, and an improvement in code making these software faster than PRANK.