Content deleted Content added
m Open access bot: doi added to citation with #oabot. |
m Task 18 (cosmetic): eval 53 templates: hyphenate params (4×); |
||
Line 50:
=== Dynamic programming ===
A direct method for producing an MSA uses the [[dynamic programming]] technique to identify the globally optimal alignment solution. For proteins, this method usually involves two sets of parameters: a [[gap penalty]] and a [[substitution matrix]] assigning scores or probabilities to the alignment of each possible pair of amino acids based on the similarity of the amino acids' chemical properties and the evolutionary probability of the mutation. For nucleotide sequences, a similar gap penalty is used, but a much simpler substitution matrix, wherein only identical matches and mismatches are considered, is typical. The scores in the substitution matrix may be either all positive or a mix of positive and negative in the case of a global alignment, but must be both positive and negative, in the case of a local alignment.<ref>{{cite web|title=Help with matrices used in sequence comparison tools|url=http://www.ebi.ac.uk/help/matrix.html|url-status=dead|archive-url=https://web.archive.org/web/20100311140200/http://www.ebi.ac.uk/help/matrix.html|archive-date=March 11, 2010|
For ''n'' individual sequences, the naive method requires constructing the ''n''-dimensional equivalent of the matrix formed in standard pairwise [[sequence alignment]]. The search space thus increases exponentially with increasing ''n'' and is also strongly dependent on sequence length. Expressed with the [[big O notation]] commonly used to measure [[Computational complexity theory|computational complexity]], a [[Naïve algorithm|naïve]] MSA takes ''O(Length<sup>Nseqs</sup>)'' time to produce. To find the global optimum for ''n'' sequences this way has been shown to be an [[NP complete|NP-complete]] problem.<ref name="wang">{{cite journal|vauthors=Wang L, Jiang T|year=1994|title=On the complexity of multiple sequence alignment|journal=J Comput Biol|volume=1|issue=4|pages=337–348|citeseerx=10.1.1.408.894|doi=10.1089/cmb.1994.1.337|pmid=8790475}}</ref><ref name="just">{{cite journal|author=Just W|year=2001|title=Computational complexity of multiple sequence alignment with SP-score|journal=J Comput Biol|volume=8|issue=6|pages=615–23|citeseerx=10.1.1.31.6382|doi=10.1089/106652701753307511|pmid=11747615}}</ref><ref name="elias">{{cite journal|author=Elias, Isaac|year=2006|title=Settling the intractability of multiple alignment|journal=J Comput Biol|volume=13|issue=7|pages=1323–1339|citeseerx=10.1.1.6.256|doi=10.1089/cmb.2006.13.1323|pmid=17037961}}</ref> In 1989, based on Carrillo-Lipman Algorithm,<ref name="carrillo">{{cite journal|vauthors=Carrillo H, Lipman DJ|year=1988|title=The Multiple Sequence Alignment Problem in Biology|url=https://zenodo.org/record/1236134|journal=SIAM Journal on Applied Mathematics|volume=48|issue=5|pages=1073–1082|doi=10.1137/0148063}}</ref> Altschul introduced a practical method that uses pairwise alignments to constrain the n-dimensional search space.<ref name="altschul">{{cite journal|vauthors=Lipman DJ, Altschul SF, Kececioglu JD|year=1989|title=A tool for multiple sequence alignment|journal=Proc Natl Acad Sci U S A|volume=86|issue=12|pages=4412–4415|bibcode=1989PNAS...86.4412L|doi=10.1073/pnas.86.12.4412|pmc=287279|pmid=2734293}}</ref> In this approach pairwise dynamic programming alignments are performed on each pair of sequences in the query set, and only the space near the n-dimensional intersection of these alignments is searched for the n-way alignment. The MSA program optimizes the sum of all of the pairs of characters at each position in the alignment (the so-called ''sum of pair'' score) and has been implemented in a software program for constructing multiple sequence alignments.<ref>{{cite web|title=Genetic analysis software|url=https://www.ncbi.nlm.nih.gov/CBBresearch/Schaffer/msa.html|
=== Progressive alignment construction ===
Line 119:
==Alignment visualization and quality control==
The necessary use of heuristics for multiple alignment means that for an arbitrary set of proteins, there is always a good chance that an alignment will contain errors. For example, an evaluation of several leading alignment programs using the [[List of sequence alignment software#Benchmarking|BAliBase benchmark]] found that at least 24% of all pairs of aligned amino acids were incorrectly aligned.<ref name="nuin2006">{{cite journal |vauthors=Nuin PA, Wang Z, Tillier ER |year=2006 |title=The accuracy of several multiple sequence alignment programs for proteins |journal=BMC Bioinformatics |doi=10.1186/1471-2105-7-471 |pmid=17062146 |volume=7 |pmc=1633746 |pages=471}}</ref> These errors can arise because of unique insertions into one or more regions of sequences, or through some more complex evolutionary process leading to proteins that do not align easily by sequence alone. As the number of sequence and their divergence increases many more errors will be made simply because of the heuristic nature of MSA algorithms. [[List of alignment visualization software|Multiple sequence alignment viewers]] enable alignments to be visually reviewed, often by inspecting the quality of alignment for annotated functional sites on two or more sequences. Many also enable the alignment to be edited to correct these (usually minor) errors, in order to obtain an optimal 'curated' alignment suitable for use in phylogenetic analysis or comparative modeling.<ref>{{cite web | title=Manual editing and adjustment of MSAs | publisher=European Molecular Biology Laboratory | year=2007 | url=http://www.embl.de/~seqanal/MSAcambridgeGenetics2007/MSAmanualAdjustments/MSAmanualAdjustments.html |
However, as the number of sequences increases and especially in genome-wide studies that involve many MSAs it is impossible to manually curate all alignments. Furthermore, manual curation is subjective. And finally, even the best expert cannot confidently align the more ambiguous cases of highly diverged sequences. In such cases it is common practice to use automatic procedures to exclude unreliably aligned regions from the MSA. For the purpose of phylogeny reconstruction (see below) the Gblocks program is widely used to remove alignment blocks suspect of low quality, according to various cutoffs on the number of gapped sequences in alignment columns.<ref name="castresana2000">{{cite journal | vauthors = Castresana J | title = Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis | journal = Mol. Biol. Evol. | volume = 17 | issue = 4 | pages = 540–52 | date = April 2000 | pmid = 10742046 | doi = 10.1093/oxfordjournals.molbev.a026334 | doi-access = free }}</ref> However, these criteria may excessively filter out regions with insertion/deletion events that may still be aligned reliably, and these regions might be desirable for other purposes such as detection of positive selection. A few alignment algorithms output site-specific scores that allow the selection of high-confidence regions. Such a service was first offered by the SOAP program,<ref name="loytynojaMilinkovitch2001">{{cite journal | vauthors = Löytynoja A, Milinkovitch MC | title = SOAP, cleaning multiple alignments from unstable blocks | journal = Bioinformatics | volume = 17 | issue = 6 | pages = 573–4 | date = June 2001 | pmid = 11395440 | doi = 10.1093/bioinformatics/17.6.573 | doi-access = free }}</ref> which tests the robustness of each column to perturbation in the parameters of the popular alignment program CLUSTALW. The T-Coffee program<ref name=poirotOTooleNotredame2003>{{cite journal | vauthors = Poirot O, O'Toole E, Notredame C | title = Tcoffee@igs: A web server for computing, evaluating and combining multiple sequence alignments | journal = Nucleic Acids Res. | volume = 31 | issue = 13 | pages = 3503–6 | date = July 2003 | pmid = 12824354 | pmc = 168929 | doi = 10.1093/nar/gkg522 }}</ref> uses a library of alignments in the construction of the final MSA, and its output MSA is colored according to confidence scores that reflect the agreement between different alignments in the library regarding each aligned residue. Its extension, [http://tcoffee.crg.cat/tcs TCS] : ('''T'''ransitive '''C'''onsistency '''S'''core), uses T-Coffee libraries of pairwise alignments to evaluate any third party MSA. Pairwise projections can be produced using fast or slow methods, thus allowing a trade-off between speed and accuracy.<ref name=TCS2014MBE>{{cite journal|last=Chang|first=JM|author2=Di Tommaso, P | author3 = Notredame, C|title=TCS: A New Multiple Sequence Alignment Reliability Measure to Estimate Alignment Accuracy and Improve Phylogenetic Tree Reconstruction.|journal=Molecular Biology and Evolution|date=Jun 2014|volume=31|issue=6|pages=1625–37|doi=10.1093/molbev/msu117|pmid=24694831|url=http://mbe.oxfordjournals.org/content/31/6/1625|doi-access=free}}</ref><ref name=TCS_2015_NAR>{{cite journal | vauthors = Chang JM, Di Tommaso P, Lefort V, Gascuel O, Notredame C | title = TCS: a web server for multiple sequence alignment evaluation and phylogenetic reconstruction | journal = Nucleic Acids Res. | volume = 43 | issue = W1 | pages = W3–6 | date = July 2015 | pmid = 25855806 | pmc = 4489230 | doi = 10.1093/nar/gkv310 }}</ref> Another alignment program that can output an MSA with confidence scores is FSA,<ref name=bradley2009>{{cite journal | vauthors = Bradley RK, Roberts A, Smoot M, Juvekar S, Do J, Dewey C, Holmes I, Pachter L | title = Fast statistical alignment | journal = PLOS Comput. Biol. | volume = 5 | issue = 5 | pages = e1000392 | date = May 2009 | pmid = 19478997 | pmc = 2684580 | doi = 10.1371/journal.pcbi.1000392 | bibcode = 2009PLSCB...5E0392B }}</ref> which uses a statistical model that allows calculation of the uncertainty in the alignment. The HoT (Heads-Or-Tails) score can be used as a measure of site-specific alignment uncertainty due to the existence of multiple co-optimal solutions.<ref name=landanGraur2008>{{cite book | vauthors = Landan G, Graur D | title = Biocomputing 2008 | chapter = Local reliability measures from sets of co-optimal multiple sequence alignments | journal = Pac Symp Biocomput | pages = 15–24 | date = 2008 | pmid = 18229673 | doi = 10.1142/9789812776136_0003 |isbn = 978-981-277-608-2 }}</ref> The GUIDANCE program<ref name="penn2010">{{cite journal | vauthors = Penn O, Privman E, Landan G, Graur D, Pupko T | title = An alignment confidence score capturing robustness to guide tree uncertainty | journal = Mol. Biol. Evol. | volume = 27 | issue = 8 | pages = 1759–67 | date = August 2010 | pmid = 20207713 | pmc = 2908709 | doi = 10.1093/molbev/msq066 }}</ref> calculates a similar site-specific confidence measure based on the robustness of the alignment to uncertainty in the guide tree that is used in progressive alignment programs. An alternative, more statistically justified approach to assess alignment uncertainty is the use of probabilistic evolutionary models for joint estimation of phylogeny and alignment. A Bayesian approach allows calculation of posterior probabilities of estimated phylogeny and alignment, which is a measure of the confidence in these estimates. In this case, a posterior probability can be calculated for each site in the alignment. Such an approach was implemented in the program BAli-Phy.<ref name="redelingsSuchard2005">{{cite journal | vauthors = Redelings BD, Suchard MA | title = Joint Bayesian estimation of alignment and phylogeny | journal = Syst. Biol. | volume = 54 | issue = 3 | pages = 401–18 | date = June 2005 | pmid = 16012107 | doi = 10.1080/10635150590947041 | doi-access = free }}</ref>
Line 126:
==Phylogenetic use==
Multiple sequence alignments can be used to create a [[phylogenetic tree]].<ref name="Budd_2009">{{Cite web | title=Multiple sequence alignment exercises and demonstrations | last=Budd | first=Aidan | publisher=European Molecular Biology Laboratory | date=10 February 2009 |
Multiple sequence alignments can also be used to identify functionally important sites, such as binding sites, active sites, or sites corresponding to other key functions, by locating conserved domains. When looking at multiple sequence alignments, it is useful to consider different aspects of the sequences when comparing sequences. These aspects include identity, similarity, and homology. Identity means that the sequences have identical residues at their respective positions. On the other hand, similarity has to do with the sequences being compared having similar residues quantitatively. For example, in terms of nucleotide sequences, pyrimidines are considered similar to each other, as are purines. Similarity ultimately leads to homology, in that the more similar sequences are, the closer they are to being homologous. This similarity in sequences can then go on to help find common ancestry.<ref name="Budd_2009"/>
|