Multiple sequence alignment: Difference between revisions

Content deleted Content added
WP:LINKs: adds, standardizes, needless: underscores > spaces, WP:PIPEs > WP:NOPIPEs. WP:REFerence WP:CITation parameters: cut whitespace characters to standardize, aid work via small screens, update-standardize. Nonacronym proper noun MOS:ALLCAPS > WP:LOWERCASE sentence case. WP:COPYEDITs WP:EoS: clarify, WP:TERSE. MOS:FIRSTABBReviations clarify, define before WP:ABBRs in parentheses. Spaced: MDASHs > NDASHs.
m Cut needless carriage return whitespace characters in sections: to standardize, aid work via small screens. MOS:FIRSTABBReviations clarify, define before WP:ABBRs in parentheses.
Line 6:
 
==Problem statement==
 
Given <math>m</math> sequences <math>S_i</math>, <math>i = 1,\cdots,m</math> similar to the form below:
 
Line 42 ⟶ 41:
 
===Tracing alignments===
 
When determining the best suited alignments for each MSA, a ''trace'' is usually generated. A trace is a set of ''realized'', or corresponding and aligned, vertices that has a specific weight based on the edges that are selected between corresponding vertices. When choosing traces for a set of sequences it is necessary to choose a trace with a maximum weight to get the best alignment of the sequences.
 
==Alignment methods==
 
There are various alignment methods used within multiple sequence to maximize scores and correctness of alignments. Each is usually based on a certain heuristic with an insight into the evolutionary process. Most try to replicate evolution to get the most realistic alignment possible to best predict relations between sequences.
 
Line 68 ⟶ 65:
A set of methods to produce MSAs while reducing the errors inherent in progressive methods are classified as "iterative" because they work similarly to progressive methods but repeatedly realign the initial sequences as well as adding new sequences to the growing MSA. One reason progressive methods are so strongly dependent on a high-quality initial alignment is the fact that these alignments are always incorporated into the final result – that is, once a sequence has been aligned into the MSA, its alignment is not considered further. This approximation improves efficiency at the cost of accuracy. By contrast, iterative methods can return to previously calculated pairwise alignments or sub-MSAs incorporating subsets of the query sequence as a means of optimizing a general [[objective function]] such as finding a high-quality alignment score.<ref name="mount"/>
 
A variety of subtly different iteration methods have been implemented and made available in software packages; reviews and comparisons have been useful but generally refrain from choosing a "best" technique.<ref name="hirosawa">{{cite journal |vauthors=Hirosawa M, Totoki Y, Hoshida M, Ishikawa M |year=1995 |title=Comprehensive study on iterative algorithms of multiple sequence alignment |journal=ComputComputer ApplApplications Biosciin the Biosciences |volume=11 |issue=1 |pages=13–18 |pmid=7796270 |doi=10.1093/bioinformatics/11.1.13}}</ref> The software package PRRN/PRRP uses a [[hill-climbing algorithm]] to optimize its MSA alignment score<ref name="gotoh">{{cite journal |doi=10.1006/jmbi.1996.0679 |author=Gotoh O |year=1996 |title=Significant improvement in accuracy of multiple protein sequence alignments by iterative refinement as assessed by reference to structural alignments |journal=J Mol Biol |volume=264 |issue=4 |pages=823–38 |pmid=8980688}}</ref> and iteratively corrects both alignment weights and locally divergent or "gappy" regions of the growing MSA.<ref name="mount">Mount DM. (2004). Bioinformatics: Sequence and Genome Analysis 2nd ed. Cold Spring Harbor Laboratory Press: Cold Spring Harbor, NY.</ref> PRRP performs best when refining an alignment previously constructed by a faster method.<ref name="mount"/>
 
Another iterative program, DIALIGN, takes an unusual approach of focusing narrowly on local alignments between sub-segments or [[sequence motif]]s without introducing a gap penalty.<ref name="brudno">{{cite journal |vauthors=Brudno M, Chapman M, Göttgens B, Batzoglou S, Morgenstern B |title=Fast and sensitive multiple alignment of large genomic sequences |journal=BMC Bioinformatics |volume=4 |pages=66 |date=December 2003 |pmid=14693042 |pmc=521198 |doi=10.1186/1471-2105-4-66 |doi-access=free}}</ref> The alignment of individual motifs is then achieved with a matrix representation similar to a dot-matrix plot in a pairwise alignment. An alternative method that uses fast local alignments as anchor points or ''seeds'' for a slower global-alignment procedure is implemented in the CHAOS/DIALIGN suite.<ref name="brudno"/>
Line 78 ⟶ 75:
 
===Hidden Markov models===
[[File:A profile HMM modelling a multiple sequence alignment.png|thumb|A profile [[hidden Markov model]] (HMM) modelling a multiple sequence alignment]]
A [[Hiddenhidden Markov model]]s are(HMM) is a probabilistic modelsmodel that can assign likelihoods to all possible combinations of gaps, matches, and mismatches, to determine the most likely MSA or set of possible MSAs. HMMs can produce a single highest-scoring output but can also generate a family of possible alignments that can then be evaluated for biological significance. HMMs can produce both global and local alignments. Although HMM-based methods have been developed relatively recently, they offer significant improvements in computational speed, especially for sequences that contain overlapping regions.<ref name="mount"/>
 
[[Hidden Markov model]]s are probabilistic models that can assign likelihoods to all possible combinations of gaps, matches, and mismatches to determine the most likely MSA or set of possible MSAs. HMMs can produce a single highest-scoring output but can also generate a family of possible alignments that can then be evaluated for biological significance. HMMs can produce both global and local alignments. Although HMM-based methods have been developed relatively recently, they offer significant improvements in computational speed, especially for sequences that contain overlapping regions.<ref name="mount"/>
 
Typical HMM-based methods work by representing an MSA as a form of [[directed acyclic graph]] known as a partial-order graph, which consists of a series of nodes representing possible entries in the columns of an MSA. In this representation a column that is absolutely conserved (that is, that all the sequences in the MSA share a particular character at a particular position) is coded as a single node with as many outgoing connections as there are possible characters in the next column of the alignment. In the terms of a typical hidden Markov model, the observed states are the individual alignment columns and the "hidden" states represent the presumed ancestral sequence from which the sequences in the query set are hypothesized to have descended. An efficient search variant of the dynamic programming method, named the [[Viterbi algorithm]], is generally used to successively align the growing MSA to the next sequence in the query set to produce a new MSA.<ref name="hughey">{{cite journal |vauthors=Hughey R, Krogh A |year=1996 |title=Hidden Markov models for sequence analysis: extension and analysis of the basic method |journal=CABIOS |volume=12 |issue=2 |pages=95–107 |pmid=8744772 |doi=10.1093/bioinformatics/12.2.95 |citeseerx=10.1.1.44.3365}}</ref> This is distinct from progressive alignment methods because the alignment of prior sequences is updated at each new sequence addition. However, like progressive methods, this technique can be influenced by the order in which the sequences in the query set are integrated into the alignment, especially when the sequences are distantly related.<ref name="mount"/>
Line 105 ⟶ 101:
 
==Optimization==
 
===Genetic algorithms and simulated annealing===
Standard optimization techniques in computer science – both of which were inspired by, but do not directly reproduce, physical processes – have also been used in an attempt to more efficiently produce quality MSAs. One such technique, [[genetic algorithm]]s, has been used for MSA production in an attempt to broadly simulate the hypothesized evolutionary process that gave rise to the divergence in the query set. The method works by breaking a series of possible MSAs into fragments and repeatedly rearranging those fragments with the introduction of gaps at varying positions. A general [[objective function]] is optimized during the simulation, most generally the "sum of pairs" maximization function introduced in dynamic programming-based MSA methods. A technique for protein sequences has been implemented in the software program SAGA (Sequence Alignment by Genetic Algorithm)<ref name="notredame2">{{cite journal |vauthors=Notredame C, Higgins DG |title=SAGA: sequence alignment by genetic algorithm |journal=Nucleic Acids Res. |volume=24 |issue=8 |pages=1515–24 |date=April 1996 |pmid=8628686 |pmc=145823 |doi=10.1093/nar/24.8.1515}}</ref> and its equivalent in RNA is called RAGA.<ref name="notredame3">{{cite journal |doi=10.1093/nar/25.22.4570 |vauthors=Notredame C, O'Brien EA, Higgins DG |year=1997 |title=RAGA: RNA sequence alignment by genetic algorithm |journal=Nucleic Acids Res |volume=25 |issue=22 |pages=4570–80 |pmid=9358168 |pmc=147093}}</ref>
 
The technique of [[simulated annealing]], by which an existing MSA produced by another method is refined by a series of rearrangements designed to find better regions of alignment space than the one the input alignment already occupies. Like the genetic algorithm method, simulated annealing maximizes an objective function like the sum-of-pairs function. Simulated annealing uses a metaphorical "temperature factor" that determines the rate at which rearrangements proceed and the likelihood of each rearrangement; typical usage alternates periods of high rearrangement rates with relatively low likelihood (to explore more distant regions of alignment space) with periods of lower rates and higher likelihoods to more thoroughly explore local minima near the newly "colonized" regions. This approach has been implemented in the program MSASA (Multiple Sequence Alignment by Simulated Annealing).<ref name="kim">{{cite journal |vauthors=Kim J, Pramanik S, Chung MJ |year=1994 |title=Multiple sequence alignment using simulated annealing |journal=ComputComputer ApplApplications Biosciin the Biosciences |volume=10 |issue=4 |pages=419–26 |pmid=7804875 |doi=10.1093/bioinformatics/10.4.419}}</ref>
 
===Mathematical programming and exact solution algorithms===
[[Mathematical optimization|Mathematical programming]] and in particular [[Mixedmixed integer programming]] models are another approach to solve MSA problems. The advantage of such optimization models is that they can be used to find the optimal MSA solution more efficiently compared to the traditional DP approach. This is due in part, to the applicability of decomposition techniques for mathematical programs, where the MSA model is decomposed into smaller parts and iteratively solved until the optimal solution is found. Example algorithms used to solve mixed integer programming models of MSA include [[branch and price]]<ref name="althaus2006">{{cite journal |doi=10.1007/s10107-005-0659-3 |vauthors=Althaus E, Caprara A, Lenhof HP, Reinert K |year=2006 |title=A branch-and-cut algorithm for multiple sequence alignment |journal=Mathematical Programming |volume=105 |issue=2–3 |pages=387–425 |s2cid=17715172}}</ref> and [[Benders decomposition]].<ref name="hosseininasab"/> Although exact approaches are computationally slow compared to heuristic algorithms for MSA, they are guaranteed to reach the optimal solution eventually, even for large-size problems.
 
[[Mathematical programming]] and in particular [[Mixed integer programming]] models are another approach to solve MSA problems. The advantage of such optimization models is that they can be used to find the optimal MSA solution more efficiently compared to the traditional DP approach. This is due in part, to the applicability of decomposition techniques for mathematical programs, where the MSA model is decomposed into smaller parts and iteratively solved until the optimal solution is found. Example algorithms used to solve mixed integer programming models of MSA include [[branch and price]]<ref name="althaus2006">{{cite journal |doi=10.1007/s10107-005-0659-3 |vauthors=Althaus E, Caprara A, Lenhof HP, Reinert K |year=2006 |title=A branch-and-cut algorithm for multiple sequence alignment |journal=Mathematical Programming |volume=105 |issue=2–3 |pages=387–425 |s2cid=17715172}}</ref> and [[Benders decomposition]].<ref name="hosseininasab"/> Although exact approaches are computationally slow compared to heuristic algorithms for MSA, they are guaranteed to reach the optimal solution eventually, even for large-size problems.
 
===Simulated quantum computing===
Line 121 ⟶ 115:
The necessary use of heuristics for multiple alignment means that for an arbitrary set of proteins, there is always a good chance that an alignment will contain errors. For example, an evaluation of several leading alignment programs using the [[List of sequence alignment software#Benchmarking|BAliBase benchmark]] found that at least 24% of all pairs of aligned amino acids were incorrectly aligned.<ref name="nuin2006">{{cite journal |vauthors=Nuin PA, Wang Z, Tillier ER |year=2006 |title=The accuracy of several multiple sequence alignment programs for proteins |journal=BMC Bioinformatics |doi=10.1186/1471-2105-7-471 |pmid=17062146 |volume=7 |pmc=1633746 |pages=471 |doi-access=free}}</ref> These errors can arise because of unique insertions into one or more regions of sequences, or through some more complex evolutionary process leading to proteins that do not align easily by sequence alone. As the number of sequence and their divergence increases many more errors will be made simply because of the heuristic nature of MSA algorithms. [[List of alignment visualization software|Multiple sequence alignment viewers]] enable alignments to be visually reviewed, often by inspecting the quality of alignment for annotated functional sites on two or more sequences. Many also enable the alignment to be edited to correct these (usually minor) errors, in order to obtain an optimal 'curated' alignment suitable for use in phylogenetic analysis or comparative modeling.<ref>{{cite web |title=Manual editing and adjustment of MSAs |publisher=European Molecular Biology Laboratory |year=2007 |url=http://www.embl.de/~seqanal/MSAcambridgeGenetics2007/MSAmanualAdjustments/MSAmanualAdjustments.html |access-date=March 7, 2010 |archive-url=https://web.archive.org/web/20150924000135/http://www.embl.de/~seqanal/MSAcambridgeGenetics2007/MSAmanualAdjustments/MSAmanualAdjustments.html |archive-date=September 24, 2015 |url-status=dead}}</ref>
 
However, as the number of sequences increases and especially in genome-wide studies that involve many MSAs it is impossible to manually curate all alignments. Furthermore, manual curation is subjective. And finally, even the best expert cannot confidently align the more ambiguous cases of highly diverged sequences. In such cases it is common practice to use automatic procedures to exclude unreliably aligned regions from the MSA. For the purpose of phylogeny reconstruction (see below) the Gblocks program is widely used to remove alignment blocks suspect of low quality, according to various cutoffs on the number of gapped sequences in alignment columns.<ref name="castresana2000">{{cite journal |vauthors=Castresana J |title=Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis |journal=Mol.Molecular Biol.Biology Evol.and Evolution |volume=17 |issue=4 |pages=540–52 |date=April 2000 |pmid=10742046 |doi=10.1093/oxfordjournals.molbev.a026334 |doi-access=free}}</ref> However, these criteria may excessively filter out regions with insertion/deletion events that may still be aligned reliably, and these regions might be desirable for other purposes such as detection of positive selection. A few alignment algorithms output site-specific scores that allow the selection of high-confidence regions. Such a service was first offered by the SOAP program,<ref name="loytynojaMilinkovitch2001">{{cite journal |vauthors=Löytynoja A, Milinkovitch MC |title=SOAP, cleaning multiple alignments from unstable blocks |journal=Bioinformatics |volume=17 |issue=6 |pages=573–4 |date=June 2001 |pmid=11395440 |doi=10.1093/bioinformatics/17.6.573 |doi-access=free}}</ref> which tests the robustness of each column to perturbation in the parameters of the popular alignment program CLUSTALW. The T-Coffee program<ref name=poirotOTooleNotredame2003>{{cite journal |vauthors=Poirot O, O'Toole E, Notredame C |title=Tcoffee@igs: A web server for computing, evaluating and combining multiple sequence alignments |journal=Nucleic Acids Res. |volume=31 |issue=13 |pages=3503–6 |date=July 2003 |pmid=12824354 |pmc=168929 |doi=10.1093/nar/gkg522}}</ref> uses a library of alignments in the construction of the final MSA, and its output MSA is colored according to confidence scores that reflect the agreement between different alignments in the library regarding each aligned residue. Its extension, TCSTransitive ('''T'''ransitiveConsistency '''C'''onsistencyScore '''S'''core(TCS), uses T-Coffee [[Library (computing)|libraries]] of pairwise alignments to evaluate any third party MSA. Pairwise projections can be produced using fast or slow methods, thus allowing a trade-off between speed and accuracy.<ref name=TCS2014MBE>{{cite journal|last=Chang|first=JM|author2=Di Tommaso, P |author3=Notredame, C|title=TCS: A New Multiple Sequence Alignment Reliability Measure to Estimate Alignment Accuracy and Improve Phylogenetic Tree Reconstruction.|journal=Molecular Biology and Evolution|date=JunJune 2014|volume=31|issue=6|pages=1625–37|doi=10.1093/molbev/msu117|pmid=24694831|doi-access=free}}</ref><ref name=TCS_2015_NAR>{{cite journal |vauthors=Chang JM, Di Tommaso P, Lefort V, Gascuel O, Notredame C |title=TCS: a web server for multiple sequence alignment evaluation and phylogenetic reconstruction |journal=Nucleic Acids Res. |volume=43 |issue=W1 |pages=W3–6 |date=July 2015 |pmid=25855806 |pmc=4489230 |doi=10.1093/nar/gkv310}}</ref> Another alignment program that can output an MSA with confidence scores is FSA,<ref name=bradley2009>{{cite journal |vauthors=Bradley RK, Roberts A, Smoot M, Juvekar S, Do J, Dewey C, Holmes I, Pachter L |title=Fast statistical alignment |journal=PLOS Comput. Biol. |volume=5 |issue=5 |pages=e1000392 |date=May 2009 |pmid=19478997 |pmc=2684580 |doi=10.1371/journal.pcbi.1000392 |bibcode=2009PLSCB...5E0392B |doi-access=free}}</ref> which uses a statistical model that allows calculation of the uncertainty in the alignment. The HoT (Heads-Or-Tails) score can be used as a measure of site-specific alignment uncertainty due to the existence of multiple co-optimal solutions.<ref name=landanGraur2008>{{cite book |vauthors=Landan G, Graur D |title=Biocomputing 2008 |chapter=Local reliability measures from sets of co-optimal multiple sequence alignments |journal=Pac Symp Biocomput |pages=15–24 |date=2008 |pmid=18229673 |doi=10.1142/9789812776136_0003 |isbn=978-981-277-608-2}}</ref> The GUIDANCE program<ref name="penn2010">{{cite journal |vauthors=Penn O, Privman E, Landan G, Graur D, Pupko T |title=An alignment confidence score capturing robustness to guide tree uncertainty |journal=Mol.Molecular Biol.Biology Evol.and Evolution |volume=27 |issue=8 |pages=1759–67 |date=August 2010 |pmid=20207713 |pmc=2908709 |doi=10.1093/molbev/msq066}}</ref> calculates a similar site-specific confidence measure based on the robustness of the alignment to uncertainty in the guide tree that is used in progressive alignment programs. An alternative, more statistically justified approach to assess alignment uncertainty is the use of probabilistic evolutionary models for joint estimation of phylogeny and alignment. A Bayesian approach allows calculation of posterior probabilities of estimated phylogeny and alignment, which is a measure of the confidence in these estimates. In this case, a posterior probability can be calculated for each site in the alignment. Such an approach was implemented in the program BAli-Phy.<ref name="redelingsSuchard2005">{{cite journal |vauthors=Redelings BD, Suchard MA |title=Joint Bayesian estimation of alignment and phylogeny |journal=Syst. Biol. |volume=54 |issue=3 |pages=401–18 |date=June 2005 |pmid=16012107 |doi=10.1080/10635150590947041 |doi-access=free}}</ref>
 
There are free programs available for visualization of multiple sequence alignments, for example [[Jalview]] and [[UGENE]].
Line 131 ⟶ 125:
 
==See also==
 
*[[Alignment-free sequence analysis]]
*[[Cladistics]]