Sequence alignment: Difference between revisions

Content deleted Content added
Adding local short description: "Process in bioinformatics that identifies equivalent sites within molecular sequences", overriding Wikidata description "process in bioinformatics that aligns (identifies equivalent sites within) molecular sequences" (Shortdesc helper)
Citation bot (talk | contribs)
Add: s2cid. Removed URL that duplicated unique identifier. | You can use this bot yourself. Report bugs here. | Suggested by SemperIocundus | via #UCB_webform
Line 28:
| doi = 10.1186/1748-7188-6-25
| pmc = 3223492
| s2cid = 2658261
| url = https://www.semanticscholar.org/paper/2eaf69f1a0524b3fa8be7e4e481b9fedb5834d74
}}</ref> A variety of computational algorithms have been applied to the sequence alignment problem. These include slow but formally correct methods like [[dynamic programming]]. These also include efficient, [[heuristic algorithm]]s or [[probability|probabilistic]] methods designed for large-scale database search, that do not guarantee to find best matches.
 
Line 86:
Progressive, hierarchical, or tree methods generate a multiple sequence alignment by first aligning the most similar sequences and then adding successively less related sequences or groups to the alignment until the entire query set has been incorporated into the solution. The initial tree describing the sequence relatedness is based on pairwise comparisons that may include heuristic pairwise alignment methods similar to [[FASTA]]. Progressive alignment results are dependent on the choice of "most related" sequences and thus can be sensitive to inaccuracies in the initial pairwise alignments. Most progressive multiple sequence alignment methods additionally weight the sequences in the query set according to their relatedness, which reduces the likelihood of making a poor choice of initial sequences and thus improves alignment accuracy.
 
Many variations of the [[Clustal]] progressive implementation<ref name=higgins>{{cite journal | journal=Gene | volume=73 | issue=1 | pages=237–44 | year=1988 | author=[[Desmond G. Higgins|Higgins DG]], Sharp PM | title=CLUSTAL: a package for performing multiple sequence alignment on a microcomputer | pmid=3243435 | doi = 10.1016/0378-1119(88)90330-7 }}</ref><ref name=thompson>{{cite journal | journal=Nucleic Acids Res | volume=22 | pages=4673–80 | year=1994 | author1=Thompson JD| author2-link=Desmond G. Higgins |author2= Higgins DG|author3= Gibson TJ. | title=CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice | pmid=7984417 |pmc=308517 |url=http://nar.oxfordjournals.org/content/22/22/4673 |doi=10.1093/nar/22.22.4673 | issue=22 }}</ref><ref name=chenna>{{cite journal | journal=Nucleic Acids Res | volume=31 | pages=3497–500 | year=2003 |author1=Chenna R |author2=Sugawara H |author3=Koike T |author4=Lopez R |author5=Gibson TJ |author6=Higgins DG |author7=Thompson JD. | title=Multiple sequence alignment with the Clustal series of programs | url=http://nar.oxfordjournals.org/cgi/pmidlookup?view=long&pmid=12824352 | pmid=12824352 | doi = 10.1093/nar/gkg500 | issue=13 | pmc=168907 }}</ref> are used for multiple sequence alignment, phylogenetic tree construction, and as input for [[protein structure prediction]]. A slower but more accurate variant of the progressive method is known as [[T-Coffee]].<ref name=notredame>{{cite journal | journal=J Mol Biol | volume=302 | issue=1 | pages=205–17 | year=2000 | author1=Notredame C| author2-link=Desmond G. Higgins |author2= Higgins DG|author3= Heringa J. | title=T-Coffee: A novel method for fast and accurate multiple sequence alignment | pmid=10964570 | doi = 10.1006/jmbi.2000.4042 | s2cid=10189971 | url=https://semanticscholar.org/paper/701a93542d50c480699d7d0660f33875ae639dfd }}</ref>
 
===Iterative methods===
Line 108:
 
===DALI===
The DALI method, or [[distance matrix]] alignment, is a fragment-based method for constructing structural alignments based on contact similarity patterns between successive hexapeptides in the query sequences.<ref name=holm>{{cite journal | journal=Science | volume=273 | pages=595–603 | year=1996 |author1=Holm L |author2=Sander C | title=Mapping the protein universe | pmid=8662544 | doi = 10.1126/science.273.5275.595 | issue=5275 | bibcode=1996Sci...273..595H | s2cid=7509134 | url=https://semanticscholar.org/paper/26eebe60c05a6d2eefb687dc63e83e753546102a }}</ref> It can generate pairwise or multiple alignments and identify a query sequence's structural neighbors in the [[Protein Data Bank]] (PDB). It has been used to construct the [[Families of structurally similar proteins|FSSP]] structural alignment database (Fold classification based on Structure-Structure alignment of Proteins, or Families of Structurally Similar Proteins). A DALI webserver can be accessed at [https://web.archive.org/web/20090301064750/http://ekhidna.biocenter.helsinki.fi/dali_server/start DALI] and the FSSP is located at [https://web.archive.org/web/20051125045348/http://ekhidna.biocenter.helsinki.fi/dali/start The Dali Database].
 
===SSAP===
Line 128:
 
Methods of statistical significance estimation for gapped sequence alignments are available in the literature.<ref name="ortet"/><ref name=altschul>{{cite book|author1=Altschul SF |author2=Gish W | year=1996| title=Local Alignment Statistics| journal= Meth.Enz. | volume=266 | pages = 460–480|doi=10.1016/S0076-6879(96)66029-7|pmid=8743700 |series=Methods in Enzymology|isbn=9780121821678}}</ref><ref name=hartmann>{{cite journal| author=Hartmann AK| year=2002| title=Sampling rare events: statistics of local sequence alignments|
journal= Phys. Rev. E| volume=65| page=056102|doi=10.1103/PhysRevE.65.056102| pmid=12059642| issue=5|arxiv=cond-mat/0108201|bibcode=2002PhRvE..65e6102H| s2cid=193085| url=https://www.semanticscholar.org/paper/bedd73ed63f6f8ea1985360f0d725630fe0f3fc3}}</ref><ref name=newberg>{{cite journal| author=Newberg LA | year=2008 | title=Significance of gapped sequence alignments | journal= J Comput Biol| volume=15| pages=1187–1194 | pmid = 18973434 | doi=10.1089/cmb.2008.0125| nopp=true| issue=9| pmc=2737730}}</ref><ref name=eddy>{{cite journal| author=Eddy SR| year=2008 | title=A probabilistic model of local sequence alignment that simplifies statistical significance estimation | journal= PLOS Comput Biol | volume=4| editor1-first=Burkhard| pages=e1000069 | pmid = 18516236| editor1-last=Rost | doi=10.1371/journal.pcbi.1000069| issue=5| pmc=2396288| last2=Rost| first2=Burkhard| bibcode=2008PLSCB...4E0069E| urls2cid=https://www.semanticscholar.org/paper/0b66a33b74518b0f0e46d5157a3b571035aab40e15640896 }}</ref><ref name=bastien>{{cite journal|author1=Bastien O |author2=Aude JC |author3=Roy S |author4=Marechal E | year=2004 | title=Fundamentals of massive automatic pairwise alignments of protein sequences: theoretical significance of Z-value statistics | journal= Bioinformatics | volume=20| issue=4| pages=534–537| pmid = 14990449| doi = 10.1093/bioinformatics/btg440 | url=http://bioinformatics.oxfordjournals.org/content/20/4/534.long| doi-access=free }}</ref><ref name=agrawal11>{{cite journal|author1=Agrawal A |author2=Huang X | year=2011| title=Pairwise Statistical Significance of Local Sequence Alignment Using Sequence-Specific and Position-Specific Substitution Matrices|journal= IEEE/ACM Transactions on Computational Biology and Bioinformatics| volume=8| pages=194–205|doi=10.1109/TCBB.2009.69|pmid=21071807 | issue=1|s2cid=6559731 |url=https://www.semanticscholar.org/paper/765f9333d5af7274c0a44b39407f78c1dcdfab0f }}</ref><ref name=agrawal08>{{cite journal| author1=Agrawal A| author2=Brendel VP| author3=Huang X| year=2008| title=Pairwise statistical significance and empirical determination of effective gap opening penalties for protein local sequence alignment| journal=International Journal of Computational Biology and Drug Design| volume=1| pages=347–367| doi=10.1504/IJCBDD.2008.022207| pmid=20063463| url=http://inderscience.metapress.com/content/1558538106522500/| issue=4| url-status=dead| archiveurl=https://archive.is/20130128163812/http://inderscience.metapress.com/content/1558538106522500/| archivedate=28 January 2013| df=dmy-all}}</ref>
 
===Assessment of credibility===
Line 139:
 
==Other biological uses==
Sequenced RNA, such as [[expressed sequence tags]] and full-length mRNAs, can be aligned to a sequenced genome to find where there are genes and get information about [[alternative splicing]]<ref>{{cite book |author1=Kim N |author2=Lee C |title=Bioinformatics detection of alternative splicing |journal=Methods Mol. Biol. |volume=452 |issue= |pages=179–97 |year=2008 |pmid=18566765 |doi=10.1007/978-1-60327-159-2_9 |series=Methods in Molecular Biology™ |isbn=978-1-58829-707-5}}</ref> and [[RNA editing]].<ref>{{cite journal |vauthors=Li JB, Levanon EY, Yoon JK, etal |title=Genome-wide identification of human RNA editing sites by parallel DNA capturing and sequencing |journal=Science |volume=324 |issue=5931 |pages=1210–3 |date=May 2009 |pmid=19478186 |doi=10.1126/science.1170995|bibcode=2009Sci...324.1210L |s2cid=31148824 |url=https://semanticscholar.org/paper/7e6aecb6226022e7471d6262f7ea5ef90b7a53f5 }}</ref> Sequence alignment is also a part of [[genome assembly]], where sequences are aligned to find overlap so that ''[[contig]]s'' (long stretches of sequence) can be formed.<ref>{{cite journal |vauthors=Blazewicz J, Bryja M, Figlerowicz M, etal |title=Whole genome assembly from 454 sequencing output via modified DNA graph concept |journal=Comput Biol Chem |volume=33 |issue=3 |pages=224–30 |date=June 2009 |pmid=19477687 |doi=10.1016/j.compbiolchem.2009.04.005}}</ref> Another use is [[single nucleotide polymorphism|SNP]] analysis, where sequences from different individuals are aligned to find single basepairs that are often different in a population.<ref>{{cite journal |author1=Duran C |author2=Appleby N |author3=Vardy M |author4=Imelfort M |author5=Edwards D |author6=Batley J |title=Single nucleotide polymorphism discovery in barley using autoSNPdb |journal=Plant Biotechnol. J. |volume=7 |issue=4 |pages=326–33 |date=May 2009 |pmid=19386041 |doi=10.1111/j.1467-7652.2009.00407.x }}</ref>
 
==Non-biological uses==
The methods used for biological sequence alignment have also found applications in other fields, most notably in [[natural language processing]] and in social sciences, where the [[Needleman-Wunsch algorithm]] is usually referred to as [[Optimal matching]].<ref>{{cite journal|author1=Abbott A. |author2=Tsay A. | year=2000 | title=Sequence Analysis and Optimal Matching Methods in Sociology, Review and Prospect | journal=Sociological Methods and Research | volume=29|issue=1 | pages=3–33 | doi=10.1177/0049124100029001001|s2cid=121097811 }}</ref> Techniques that generate the set of elements from which words will be selected in natural-language generation algorithms have borrowed multiple sequence alignment techniques from bioinformatics to produce linguistic versions of computer-generated mathematical proofs.<ref name=Barzilay>{{cite journal|author1=Barzilay R |author2=Lee L. |year=2002 | title= Bootstrapping Lexical Choice via Multiple-Sequence Alignment | journal=Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP) | pages=164–171 | url=http://www.cs.cornell.edu/home/llee/papers/gen-msa.pdf| volume=10| doi=10.3115/1118693.1118715|arxiv=cs/0205065|bibcode=2002cs........5065B |s2cid=7521453 }}</ref> In the field of historical and comparative [[linguistics]], sequence alignment has been used to partially automate the [[comparative method (linguistics)|comparative method]] by which linguists traditionally reconstruct languages.<ref>{{cite journal |author=Kondrak, Grzegorz |title=Algorithms for Language Reconstruction |publisher=University of Toronto, Ontario |year=2002 |url=http://www.cs.ualberta.ca/~kondrak/papers/thesis.pdf |accessdate=2007-01-21 |journal= |archive-url=https://web.archive.org/web/20081217043010/http://www.cs.ualberta.ca/~kondrak/papers/thesis.pdf |archive-date=17 December 2008 |url-status=dead }}</ref> Business and marketing research has also applied multiple sequence alignment techniques in analyzing series of purchases over time.<ref name=prinzie>{{cite journal|author1=Prinzie A. |author2=D. Van den Poel |year=2006 | url=http://econpapers.repec.org/paper/rugrugwps/05_2F292.htm | title=Incorporating sequential information into traditional classification models by using an element/position-sensitive SAM | journal=Decision Support Systems | volume=42 | issue=2| pages= 508–526 | doi=10.1016/j.dss.2005.02.004}} See also Prinzie and Van den Poel's paper {{cite journal | url=http://econpapers.repec.org/paper/rugrugwps/07_2F442.htm | title=Predicting home-appliance acquisition sequences: Markov/Markov for Discrimination and survival analysis for modeling sequential information in NPTB models | year=2007 | journal=Decision Support Systems | volume=44 | issue=1 | pages= 28–45 | doi=10.1016/j.dss.2007.02.008 | author=Prinzie, A | last2=Vandenpoel | first2=D}}</ref>
 
==Software==