Sequence alignment: Difference between revisions

Content deleted Content added
m General fixes and manual cleanup, replaced: Open Source software → open-source software
Citation bot (talk | contribs)
m Alter: title. Add: doi. Removed URL that duplicated unique identifier. Removed parameters. | You can use this bot yourself. Report bugs here. | User-activated.
Line 77:
 
===Dynamic programming===
The technique of dynamic programming is theoretically applicable to any number of sequences; however, because it is computationally expensive in both time and [[computer memory|memory]], it is rarely used for more than three or four sequences in its most basic form. This method requires constructing the ''n''-dimensional equivalent of the sequence matrix formed from two sequences, where ''n'' is the number of sequences in the query. Standard dynamic programming is first used on all pairs of query sequences and then the "alignment space" is filled in by considering possible matches or gaps at intermediate positions, eventually constructing an alignment essentially between each two-sequence alignment. Although this technique is computationally expensive, its guarantee of a global optimum solution is useful in cases where only a few sequences need to be aligned accurately. One method for reducing the computational demands of dynamic programming, which relies on the "sum of pairs" [[objective function]], has been implemented in the [https://www.ncbi.nlm.nih.gov/CBBresearch/Schaffer/msa.html MSA] software package.<ref name=lipman>{{cite journal | journal=Proc Natl Acad Sci USA | volume=86 | pages=4412–5 | year=1989 |author1=Lipman DJ |author2=Altschul SF |author3=Kececioglu JD | title=A tool for multiple sequence alignment | url=http://www.pnas.org/cgi/pmidlookup?view=long&pmid=2734293 | pmid=2734293 | doi=10.1073/pnas.86.12.4412 | issue=12 | pmc=287279 | bibcode=1989PNAS...86.4412L }}</ref>
 
===Progressive methods===
Line 97:
==Structural alignment==
{{Main|Structural alignment}}
Structural alignments, which are usually specific to protein and sometimes RNA sequences, use information about the [[secondary structure|secondary]] and [[tertiary structure]] of the protein or RNA molecule to aid in aligning the sequences. These methods can be used for two or more sequences and typically produce local alignments; however, because they depend on the availability of structural information, they can only be used for sequences whose corresponding structures are known (usually through [[X-ray crystallography]] or [[NMR spectroscopy]]). Because both protein and RNA structure is more evolutionarily conserved than sequence,<ref name=chothia>{{cite journal | journal=EMBO J | volume=5 | issue=4 | pages=823–6 |date=April 1986 |author1=Chothia C |author2=Lesk AM. | title=The relation between the divergence of sequence and structure in proteins | pmid=3709526 |pmc=1166865 | doi=10.1002/j.1460-2075.1986.tb04288.x }}</ref> structural alignments can be more reliable between sequences that are very distantly related and that have diverged so extensively that sequence comparison cannot reliably detect their similarity.
 
Structural alignments are used as the "gold standard" in evaluating alignments for homology-based [[protein structure prediction]]<ref name=skolnick>{{cite journal | journal=Proc Natl Acad Sci USA | volume=102 | pages=1029–34 | year=2005 |author1=Zhang Y |author2=Skolnick J. | title=The protein structure prediction problem could be solved using the current PDB library | url=http://www.pnas.org/cgi/pmidlookup?view=long&pmid=15653774 | pmid=15653774 | doi = 10.1073/pnas.0407152101 | issue=4 | pmc=545829 | bibcode=2005PNAS..102.1029Z }}</ref> because they explicitly align regions of the protein sequence that are structurally similar rather than relying exclusively on sequence information. However, clearly structural alignments cannot be used in structure prediction because at least one sequence in the query set is the target to be modeled, for which the structure is not known. It has been shown that, given the structural alignment between a target and a template sequence, highly accurate models of the target protein sequence can be produced; a major stumbling block in homology-based structure prediction is the production of structurally accurate alignments given only sequence information.<ref name=skolnick/>
 
===DALI===
The DALI method, or [[distance matrix]] alignment, is a fragment-based method for constructing structural alignments based on contact similarity patterns between successive hexapeptides in the query sequences.<ref name=holm>{{cite journal | journal=Science | volume=273 | pages=595–603 | year=1996 |author1=Holm L |author2=Sander C | title=Mapping the protein universe | url=http://www.sciencemag.org/cgi/pmidlookup?view=long&pmid=8662544 | pmid=8662544 | doi = 10.1126/science.273.5275.595 | issue=5275 | bibcode=1996Sci...273..595H }}</ref> It can generate pairwise or multiple alignments and identify a query sequence's structural neighbors in the [[Protein Data Bank]] (PDB). It has been used to construct the [[Families of structurally similar proteins|FSSP]] structural alignment database (Fold classification based on Structure-Structure alignment of Proteins, or Families of Structurally Similar Proteins). A DALI webserver can be accessed at [https://web.archive.org/web/20090301064750/http://ekhidna.biocenter.helsinki.fi/dali_server/start DALI] and the FSSP is located at [https://web.archive.org/web/20051125045348/http://ekhidna.biocenter.helsinki.fi/dali/start The Dali Database].
 
===SSAP===
Line 122:
 
Methods of statistical significance estimation for gapped sequence alignments are available in the literature.<ref name="ortet"/><ref name=altschul>{{cite book|author1=Altschul SF |author2=Gish W | year=1996| title=Local Alignment Statistics| journal= Meth.Enz. | volume=266 | pages = 460–480|doi=10.1016/S0076-6879(96)66029-7|series=Methods in Enzymology|isbn=9780121821678}}</ref><ref name=hartmann>{{cite journal| author=Hartmann AK| year=2002| title=Sampling rare events: statistics of local sequence alignments|
journal= Phys. Rev. E| volume=65| page=056102|doi=10.1103/PhysRevE.65.056102| pmid=12059642| issue=5|arxiv=cond-mat/0108201|bibcode=2002PhRvE..65e6102H}}</ref><ref name=newberg>{{cite journal| author=Newberg LA | year=2008 | title=Significance of gapped sequence alignments | journal= J Comput Biolo | volume=15| pages=1187–1194 | pmid = 18973434 | doi=10.1089/cmb.2008.0125| nopp=true| issue=9| pmc=2737730}}</ref><ref name=eddy>{{cite journal| author=Eddy SR| year=2008 | title=A probabilistic model of local sequence alignment that simplifies statistical significance estimation | journal= PLoS Comput Biol | volume=4| editor1-first=Burkhard| pages=e1000069 | pmid = 18516236| editor1-last=Rost | doi=10.1371/journal.pcbi.1000069| issue=5| pmc=2396288| last2=Rost| first2=Burkhard| bibcode=2008PLSCB...4E0069E}}</ref><ref name=bastien>{{cite journal|author1=Bastien O |author2=Aude JC |author3=Roy S |author4=Marechal E | year=2004 | title=Fundamentals of massive automatic pairwise alignments of protein sequences: theoretical significance of Z-value statistics | journal= Bioinformatics | volume=20| issue=4| pages=534–537| pmid = 14990449| doi = 10.1093/bioinformatics/btg440 | url=http://bioinformatics.oxfordjournals.org/content/20/4/534.long}}</ref><ref name=agrawal11>{{cite journal|author1=Agrawal A |author2=Huang X | year=2011| title=Pairwise Statistical Significance of Local Sequence Alignment Using Sequence-Specific and Position-Specific Substitution Matrices|journal= IEEE/ACM Transactions on Computational Biology and Bioinformatics| volume=8| pages=194–205|doi=10.1109/TCBB.2009.69|pmid=21071807 |url=http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5276793|archive-url=https://archive.is/20130415004914/http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5276793|dead-url=yes|archive-date=2013-04-15| issue=1}}</ref><ref name=agrawal08>{{cite journal| author1=Agrawal A| author2=Brendel VP| author3=Huang X| year=2008| title=Pairwise statistical significance and empirical determination of effective gap opening penalties for protein local sequence alignment| journal=International Journal of Computational Biology and Drug Design| volume=1| pages=347–367| doi=10.1504/IJCBDD.2008.022207| url=http://inderscience.metapress.com/content/1558538106522500/| issue=4| deadurl=yes| archiveurl=https://archive.is/20130128163812/http://inderscience.metapress.com/content/1558538106522500/| archivedate=28 January 2013| df=dmy-all}}</ref>
 
===Assessment of credibility===
Line 136:
 
==Non-biological uses==
The methods used for biological sequence alignment have also found applications in other fields, most notably in [[natural language processing]] and in social sciences, where the [[Needleman-Wunsch algorithm]] is usually referred to as [[Optimal matching]].<ref>{{cite journal|author1=Abbott A. |author2=Tsay A. | year=2000 | title=Sequence Analysis and Optimal Matching Methods in Sociology, Review and Prospect | journal=Sociological Methods and Research | volume=29|issue=1 | pages=3–33 | doi=10.1177/0049124100029001001}}</ref> Techniques that generate the set of elements from which words will be selected in natural-language generation algorithms have borrowed multiple sequence alignment techniques from bioinformatics to produce linguistic versions of computer-generated mathematical proofs.<ref name=Barzilay>{{cite journal|author1=Barzilay R |author2=Lee L. |year=2002 | title= Bootstrapping Lexical Choice via Multiple-Sequence Alignment | journal=Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP) | pages=164–171 | url=http://www.cs.cornell.edu/home/llee/papers/gen-msa.pdf|format=PDF| volume=10| doi=10.3115/1118693.1118715|arxiv=cs/0205065}}</ref> In the field of historical and comparative [[linguistics]], sequence alignment has been used to partially automate the [[comparative method (linguistics)|comparative method]] by which linguists traditionally reconstruct languages.<ref>{{cite journal |author=Kondrak, Grzegorz |title=Algorithms for Language Reconstruction |publisher=University of Toronto, Ontario |year=2002 |url=http://www.cs.ualberta.ca/~kondrak/papers/thesis.pdf |format=PDF |accessdate = 2007-01-21 }}</ref> Business and marketing research has also applied multiple sequence alignment techniques in analyzing series of purchases over time.<ref name=prinzie>{{cite journal|author1=Prinzie A. |author2=D. Van den Poel |year=2006 | url=http://econpapers.repec.org/paper/rugrugwps/05_2F292.htm | title=Incorporating sequential information into traditional classification models by using an element/position-sensitive SAM | journal=Decision Support Systems | volume=42 | issue=2| pages= 508–526 | doi=10.1016/j.dss.2005.02.004}} See also Prinzie and Van den Poel's paper {{cite journal | url=http://econpapers.repec.org/paper/rugrugwps/07_2F442.htm | title=Predicting home-appliance acquisition sequences: Markov/Markov for Discrimination and survival analysis for modeling sequential information in NPTB models | year=2007 | journal=Decision Support Systems | volume=44 | issue=1 | pages= 28–45 | doi=10.1016/j.dss.2007.02.008 | author=Prinzie, A | last2=Vandenpoel | first2=D}}</ref>
 
==Software==
Line 142:
A more complete list of available software categorized by algorithm and alignment type is available at [[sequence alignment software]], but common software tools used for general sequence alignment tasks include ClustalW2<ref>{{cite web|url=http://www.ebi.ac.uk/Tools/msa/clustalw2/|title=ClustalW2 < Multiple Sequence Alignment < EMBL-EBI|last=EMBL-EBI|date=|website=www.EBI.ac.uk|access-date=12 June 2017}}</ref> and T-coffee<ref>[https://web.archive.org/web/20080918022531/http://tcoffee.vital-it.ch/cgi-bin/Tcoffee/tcoffee_cgi/index.cgi T-coffee]</ref> for alignment, and BLAST<ref>{{cite web|url=http://blast.ncbi.nlm.nih.gov/Blast.cgi|title=BLAST: Basic Local Alignment Search Tool|author=|date=|website=blast.ncbi.nlm.NIH.gov|access-date=12 June 2017}}</ref> and FASTA3x<ref>{{cite web|url=http://fasta.bioch.virginia.edu/fasta_www2/fasta_list2.shtml|title=UVA FASTA Server|author=|date=|website=fasta.bioch.Virginia.edu|access-date=12 June 2017}}</ref> for database searching. Commercial tools such as [[DNASTAR|DNASTAR Lasergene]], [[Geneious]], and [[PatternHunter]] are also available. Tools annotated as performing [http://edamontology.org/operation_0292 sequence alignment] are listed in the [https://bio.tools/?page=1&function=%22Sequence%20alignment%22&sort=score bio.tools] registry.
 
Alignment algorithms and software can be directly compared to one another using a standardized set of [[Benchmark (computing)|benchmark]] reference multiple sequence alignments known as BAliBASE.<ref name=thompson2>{{cite journal | journal=Bioinformatics | volume=15 | pages=87–8 | year=1999 |author1=Thompson JD |author2=Plewniak F |author3=Poch O | title=BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs | url=http://bioinformatics.oxfordjournals.org/cgi/pmidlookup?view=long&pmid=10068696 | pmid=10068696 | doi = 10.1093/bioinformatics/15.1.87 | issue=1 }}</ref> The data set consists of structural alignments, which can be considered a standard against which purely sequence-based methods are compared. The relative performance of many common alignment methods on frequently encountered alignment problems has been tabulated and selected results published online at BAliBASE.<ref>[https://web.archive.org/web/20121130084356/http://bips.u-strasbg.fr/fr/Products/Databases/BAliBASE/prog_scores.html BAliBASE]</ref><ref name=thompson3>{{cite journal | journal=Nucleic Acids Res | volume=27 | pages=2682–90 | year=1999 |author1=Thompson JD |author2=Plewniak F |author3=Poch O. | title=A comprehensive comparison of multiple sequence alignment programs | url=http://nar.oxfordjournals.org/cgi/pmidlookup?view=long&pmid=10373585 | pmid=10373585 | doi = 10.1093/nar/27.13.2682 | issue=13 | pmc=148477 }}</ref> A comprehensive list of BAliBASE scores for many (currently 12) different alignment tools can be computed within the protein workbench STRAP.<ref>{{cite web|url=http://3d-alignment.eu/|title=Multiple sequence alignment: Strap|author=|date=|website=3d-alignment.eu|access-date=12 June 2017}}</ref>
 
==See also==