Content deleted Content added
ShelfSkewed (talk | contribs) →See also: removed entry linking to dab page that doesn't seem useful |
Citation bot (talk | contribs) Add: doi-access. | Use this bot. Report bugs. | #UCB_CommandLine |
||
Line 70:
A variety of subtly different iteration methods have been implemented and made available in software packages; reviews and comparisons have been useful but generally refrain from choosing a "best" technique.<ref name="hirosawa">{{cite journal |vauthors=Hirosawa M, Totoki Y, Hoshida M, Ishikawa M | year = 1995 | title = Comprehensive study on iterative algorithms of multiple sequence alignment | journal = Comput Appl Biosci | volume = 11 | issue = 1| pages = 13–18 | pmid = 7796270 | doi=10.1093/bioinformatics/11.1.13}}</ref> The software package [http://www.genome.jp/tools/prrn/ PRRN/PRRP] uses a [[hill-climbing algorithm]] to optimize its MSA alignment score<ref name="gotoh">{{cite journal | doi = 10.1006/jmbi.1996.0679 | author = Gotoh O | year = 1996 | title = Significant improvement in accuracy of multiple protein sequence alignments by iterative refinement as assessed by reference to structural alignments | journal = J Mol Biol | volume = 264 | issue = 4| pages = 823–38 | pmid = 8980688 }}</ref> and iteratively corrects both alignment weights and locally divergent or "gappy" regions of the growing MSA.<ref name="mount">Mount DM. (2004). Bioinformatics: Sequence and Genome Analysis 2nd ed. Cold Spring Harbor Laboratory Press: Cold Spring Harbor, NY.</ref> PRRP performs best when refining an alignment previously constructed by a faster method.<ref name="mount"/>
Another iterative program, DIALIGN, takes an unusual approach of focusing narrowly on local alignments between sub-segments or [[sequence motif]]s without introducing a gap penalty.<ref name="brudno">{{cite journal | vauthors = Brudno M, Chapman M, Göttgens B, Batzoglou S, Morgenstern B | title = Fast and sensitive multiple alignment of large genomic sequences | journal = BMC Bioinformatics | volume = 4 | pages = 66 | date = December 2003 | pmid = 14693042 | pmc = 521198 | doi = 10.1186/1471-2105-4-66 | doi-access = free }}</ref> The alignment of individual motifs is then achieved with a matrix representation similar to a dot-matrix plot in a pairwise alignment. An alternative method that uses fast local alignments as anchor points or "seeds" for a slower global-alignment procedure is implemented in the [http://dialign.gobics.de/chaos-dialign-submission CHAOS/DIALIGN] suite.<ref name="brudno"/>
A third popular iteration-based method called [[MUSCLE (alignment software)|MUSCLE]] (multiple sequence alignment by log-expectation) improves on progressive methods with a more accurate distance measure to assess the relatedness of two sequences.<ref name="edgar">{{cite journal | doi = 10.1093/nar/gkh340 | author = Edgar RC | year = 2004 | title = MUSCLE: multiple sequence alignment with high accuracy and high throughput | journal = Nucleic Acids Research | volume = 32 | issue = 5| pages = 1792–97 | pmid=15034147 | pmc=390337}}</ref> The distance measure is updated between iteration stages (although, in its original form, MUSCLE contained only 2-3 iterations depending on whether refinement was enabled).
===Consensus methods===
Consensus methods attempt to find the optimal multiple sequence alignment given multiple different alignments of the same set of sequences. There are two commonly used consensus methods, [http://www.tcoffee.org/Projects/mcoffee/ M-COFFEE] and [http://www.stevekellylab.com/software/mergealign MergeAlign].<ref name="mergealign">{{cite journal | doi = 10.1186/1471-2105-13-117 |vauthors=Collingridge PW, Kelly S | year = 2012 | title = MergeAlign: improving multiple sequence alignment performance by dynamic reconstruction of consensus multiple sequence alignments | journal = BMC Bioinformatics | volume = 13 | issue = 117 |pages=117 | pmid=22646090 | pmc=3413523 |doi-access=free }}</ref> M-COFFEE uses multiple sequence alignments generated by seven different methods to generate consensus alignments. MergeAlign is capable of generating consensus alignments from any number of input alignments generated using different models of sequence evolution or different methods of multiple sequence alignment. The default option for MergeAlign is to infer a consensus alignment using alignments generated using 91 different models of protein sequence evolution.
===Hidden Markov models===
Line 91:
Most multiple sequence alignment methods try to minimize the number of [[Indel|insertions/deletions]] (gaps) and, as a consequence, produce compact alignments. This causes several problems if the sequences to be aligned contain non-[[Homology (biology)#Sequence homology|homologous]] regions, if gaps are informative in a [[phylogeny]] analysis. These problems are common in newly produced sequences that are poorly annotated and may contain [[Frameshift|frame-shifts]], wrong [[Protein ___domain|domains]] or non-homologous [[RNA splicing|spliced]] [[exons]]. The first such method was developed in 2005 by Löytynoja and Goldman.<ref name="Loytynoja-2005">{{Cite journal | last1 = Loytynoja | first1 = A. | title = An algorithm for progressive multiple alignment of sequences with insertions | doi = 10.1073/pnas.0409137102 | journal = Proceedings of the National Academy of Sciences | volume = 102 | issue = 30 | pages = 10557–10562 | year = 2005 | pmid = 16000407| pmc = 1180752| bibcode = 2005PNAS..10210557L | doi-access = free }}</ref> The same authors released a software package called ''PRANK'' in 2008.<ref name="Loytynoja-2008">{{cite journal | vauthors = Löytynoja A, Goldman N | title = Phylogeny-aware gap placement prevents errors in sequence alignment and evolutionary analysis | journal = Science | volume = 320 | issue = 5883 | pages = 1632–5 | date = June 2008 | pmid = 18566285 | doi = 10.1126/science.1158395 | bibcode = 2008Sci...320.1632L | s2cid = 5211928 }}</ref> PRANK improves alignments when insertions are present. Nevertheless, it runs slowly compared to progressive and/or iterative methods which have been developed for several years.
In 2012, two new phylogeny-aware tools appeared. One is called ''PAGAN'' that was developed by the same team as PRANK.<ref name="Loytynoja-2012">{{cite journal | vauthors = Löytynoja A, Vilella AJ, Goldman N | title = Accurate extension of multiple sequence alignments using a phylogeny-aware graph algorithm | journal = Bioinformatics | volume = 28 | issue = 13 | pages = 1684–91 | date = July 2012 | pmid = 22531217 | pmc = 3381962 | doi = 10.1093/bioinformatics/bts198 }}</ref> The other is ''ProGraphMSA'' developed by Szalkowski.<ref name="Szalkowski-2012">{{cite journal | vauthors = Szalkowski AM | title = Fast and robust multiple sequence alignment with phylogeny-aware gap placement | journal = BMC Bioinformatics | volume = 13 | pages = 129 | date = June 2012 | pmid = 22694311 | pmc = 3495709 | doi = 10.1186/1471-2105-13-129 | doi-access = free }}</ref> Both software packages were developed independently but share common features, notably the use of [[Directed acyclic graph#Partial orders and topological ordering|graph algorithms]] to improve the recognition of non-homologous regions, and an improvement in code making these software faster than PRANK.
===Motif finding===
Line 119:
==Alignment visualization and quality control==
The necessary use of heuristics for multiple alignment means that for an arbitrary set of proteins, there is always a good chance that an alignment will contain errors. For example, an evaluation of several leading alignment programs using the [[List of sequence alignment software#Benchmarking|BAliBase benchmark]] found that at least 24% of all pairs of aligned amino acids were incorrectly aligned.<ref name="nuin2006">{{cite journal |vauthors=Nuin PA, Wang Z, Tillier ER |year=2006 |title=The accuracy of several multiple sequence alignment programs for proteins |journal=BMC Bioinformatics |doi=10.1186/1471-2105-7-471 |pmid=17062146 |volume=7 |pmc=1633746 |pages=471 |doi-access=free }}</ref> These errors can arise because of unique insertions into one or more regions of sequences, or through some more complex evolutionary process leading to proteins that do not align easily by sequence alone. As the number of sequence and their divergence increases many more errors will be made simply because of the heuristic nature of MSA algorithms. [[List of alignment visualization software|Multiple sequence alignment viewers]] enable alignments to be visually reviewed, often by inspecting the quality of alignment for annotated functional sites on two or more sequences. Many also enable the alignment to be edited to correct these (usually minor) errors, in order to obtain an optimal 'curated' alignment suitable for use in phylogenetic analysis or comparative modeling.<ref>{{cite web | title=Manual editing and adjustment of MSAs | publisher=European Molecular Biology Laboratory | year=2007 | url=http://www.embl.de/~seqanal/MSAcambridgeGenetics2007/MSAmanualAdjustments/MSAmanualAdjustments.html | access-date=March 7, 2010 | archive-url=https://web.archive.org/web/20150924000135/http://www.embl.de/~seqanal/MSAcambridgeGenetics2007/MSAmanualAdjustments/MSAmanualAdjustments.html | archive-date=September 24, 2015 | url-status=dead }}</ref>
However, as the number of sequences increases and especially in genome-wide studies that involve many MSAs it is impossible to manually curate all alignments. Furthermore, manual curation is subjective. And finally, even the best expert cannot confidently align the more ambiguous cases of highly diverged sequences. In such cases it is common practice to use automatic procedures to exclude unreliably aligned regions from the MSA. For the purpose of phylogeny reconstruction (see below) the Gblocks program is widely used to remove alignment blocks suspect of low quality, according to various cutoffs on the number of gapped sequences in alignment columns.<ref name="castresana2000">{{cite journal | vauthors = Castresana J | title = Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis | journal = Mol. Biol. Evol. | volume = 17 | issue = 4 | pages = 540–52 | date = April 2000 | pmid = 10742046 | doi = 10.1093/oxfordjournals.molbev.a026334 | doi-access = free }}</ref> However, these criteria may excessively filter out regions with insertion/deletion events that may still be aligned reliably, and these regions might be desirable for other purposes such as detection of positive selection. A few alignment algorithms output site-specific scores that allow the selection of high-confidence regions. Such a service was first offered by the SOAP program,<ref name="loytynojaMilinkovitch2001">{{cite journal | vauthors = Löytynoja A, Milinkovitch MC | title = SOAP, cleaning multiple alignments from unstable blocks | journal = Bioinformatics | volume = 17 | issue = 6 | pages = 573–4 | date = June 2001 | pmid = 11395440 | doi = 10.1093/bioinformatics/17.6.573 | doi-access = free }}</ref> which tests the robustness of each column to perturbation in the parameters of the popular alignment program CLUSTALW. The T-Coffee program<ref name=poirotOTooleNotredame2003>{{cite journal | vauthors = Poirot O, O'Toole E, Notredame C | title = Tcoffee@igs: A web server for computing, evaluating and combining multiple sequence alignments | journal = Nucleic Acids Res. | volume = 31 | issue = 13 | pages = 3503–6 | date = July 2003 | pmid = 12824354 | pmc = 168929 | doi = 10.1093/nar/gkg522 }}</ref> uses a library of alignments in the construction of the final MSA, and its output MSA is colored according to confidence scores that reflect the agreement between different alignments in the library regarding each aligned residue. Its extension, [http://tcoffee.crg.cat/tcs TCS] : ('''T'''ransitive '''C'''onsistency '''S'''core), uses T-Coffee libraries of pairwise alignments to evaluate any third party MSA. Pairwise projections can be produced using fast or slow methods, thus allowing a trade-off between speed and accuracy.<ref name=TCS2014MBE>{{cite journal|last=Chang|first=JM|author2=Di Tommaso, P | author3 = Notredame, C|title=TCS: A New Multiple Sequence Alignment Reliability Measure to Estimate Alignment Accuracy and Improve Phylogenetic Tree Reconstruction.|journal=Molecular Biology and Evolution|date=Jun 2014|volume=31|issue=6|pages=1625–37|doi=10.1093/molbev/msu117|pmid=24694831|doi-access=free}}</ref><ref name=TCS_2015_NAR>{{cite journal | vauthors = Chang JM, Di Tommaso P, Lefort V, Gascuel O, Notredame C | title = TCS: a web server for multiple sequence alignment evaluation and phylogenetic reconstruction | journal = Nucleic Acids Res. | volume = 43 | issue = W1 | pages = W3–6 | date = July 2015 | pmid = 25855806 | pmc = 4489230 | doi = 10.1093/nar/gkv310 }}</ref> Another alignment program that can output an MSA with confidence scores is FSA,<ref name=bradley2009>{{cite journal | vauthors = Bradley RK, Roberts A, Smoot M, Juvekar S, Do J, Dewey C, Holmes I, Pachter L | title = Fast statistical alignment | journal = PLOS Comput. Biol. | volume = 5 | issue = 5 | pages = e1000392 | date = May 2009 | pmid = 19478997 | pmc = 2684580 | doi = 10.1371/journal.pcbi.1000392 | bibcode = 2009PLSCB...5E0392B | doi-access = free }}</ref> which uses a statistical model that allows calculation of the uncertainty in the alignment. The HoT (Heads-Or-Tails) score can be used as a measure of site-specific alignment uncertainty due to the existence of multiple co-optimal solutions.<ref name=landanGraur2008>{{cite book | vauthors = Landan G, Graur D | title = Biocomputing 2008 | chapter = Local reliability measures from sets of co-optimal multiple sequence alignments | journal = Pac Symp Biocomput | pages = 15–24 | date = 2008 | pmid = 18229673 | doi = 10.1142/9789812776136_0003 |isbn = 978-981-277-608-2 }}</ref> The GUIDANCE program<ref name="penn2010">{{cite journal | vauthors = Penn O, Privman E, Landan G, Graur D, Pupko T | title = An alignment confidence score capturing robustness to guide tree uncertainty | journal = Mol. Biol. Evol. | volume = 27 | issue = 8 | pages = 1759–67 | date = August 2010 | pmid = 20207713 | pmc = 2908709 | doi = 10.1093/molbev/msq066 }}</ref> calculates a similar site-specific confidence measure based on the robustness of the alignment to uncertainty in the guide tree that is used in progressive alignment programs. An alternative, more statistically justified approach to assess alignment uncertainty is the use of probabilistic evolutionary models for joint estimation of phylogeny and alignment. A Bayesian approach allows calculation of posterior probabilities of estimated phylogeny and alignment, which is a measure of the confidence in these estimates. In this case, a posterior probability can be calculated for each site in the alignment. Such an approach was implemented in the program BAli-Phy.<ref name="redelingsSuchard2005">{{cite journal | vauthors = Redelings BD, Suchard MA | title = Joint Bayesian estimation of alignment and phylogeny | journal = Syst. Biol. | volume = 54 | issue = 3 | pages = 401–18 | date = June 2005 | pmid = 16012107 | doi = 10.1080/10635150590947041 | doi-access = free }}</ref>
There are free programs available for visualization of multiple sequence alignments, for example [[Jalview]] and [[UGENE]].
Line 149:
*{{Cite journal | first1 = J. D. | last1 = Thompson | first2 = F. | last2 = Plewniak | first3 = O. | last3 = Poch | year = 1999 | title = A comprehensive comparison of multiple sequence alignment programs | journal = Nucleic Acids Research | volume = 27 | issue = 13 | pages= 12682–2690 | doi = 10.1093/nar/27.13.2682 | pmid = 10373585 | pmc = 148477 }}
*{{Cite journal | first1 = I.M. | last1 = Wallace | first2 = G. | last2 = Blackshields | first3 = D.G. | last3 = Higgins | year = 2005 | title = Multiple sequence alignments | journal = Curr Opin Struct Biol | volume = 15 | issue = 3 | pages = 261–266 | doi = 10.1016/j.sbi.2005.04.002 | pmid = 15963889}}
*{{Cite journal | first = C | last = Notredame | year = 2007 | title = Recent Evolutions of Multiple Sequence Alignment Algorithms| journal = PLOS Computational Biology | volume = 3 | issue = 8 | page = e123 | doi = 10.1371/journal.pcbi.0030123 | pmid = 17784778 | pmc = 1963500| bibcode = 2007PLSCB...3..123N | doi-access = free }}
==External links==
|