Revision as of 01:20, 12 November 2018 edit Kunigami (talk \| contribs) 2 edits →Word methods ← Previous edit		Revision as of 02:09, 28 November 2018 edit undo Citation bot (talk \| contribs) Bots 5,872,184 edits m Alter: isbn, template type. Add: pmid, citeseerx. Removed parameters. You can use this bot yourself. Report bugs here. \| User-activated; Category:Bioinformatics. Next edit →
Line 1: {{Refimprove\|date=March 2009}} In [[bioinformatics]], a '''sequence alignment''' is a way of arranging the sequences of [[DNA]], [[RNA]], or protein to identify regions of similarity that may be a consequence of functional, [[structural biology\|structural]], or [[evolution]]ary relationships between the sequences.<ref name=mount>{{cite book\| author=Mount DM.\| year=2004 \| title=Bioinformatics: Sequence and Genome Analysis \|edition=2nd \| publisher= Cold Spring Harbor Laboratory Press: Cold Spring Harbor, NY. \|isbn=978-0-87969-608-75}}</ref> Aligned sequences of [[nucleotide]] or [[amino acid]] residues are typically represented as rows within a [[matrix (mathematics)\|matrix]]. Gaps are inserted between the [[Residue (chemistry)\|residues]] so that identical or similar characters are aligned in successive columns. Sequence alignments are also used for non-biological sequences, such as calculating the [[Edit distance\|edit distance cost]] between strings in a [[natural language]] or in financial data. Line 74: {{Main\|Multiple sequence alignment}} [[Image:Hemagglutinin-alignments.png\|right\|thumb\|300px\|Alignment of 27 [[avian influenza]] [[hemagglutinin]] protein sequences colored by residue conservation (top) and residue properties (bottom)]] [[Multiple sequence alignment]] is an extension of pairwise alignment to incorporate more than two sequences at a time. Multiple alignment methods try to align all of the sequences in a given query set. Multiple alignments are often used in identifying [[conservation (genetics)\|conserved]] sequence regions across a group of sequences hypothesized to be evolutionarily related. Such conserved sequence motifs can be used in conjunction with structural and [[reaction mechanism\|mechanistic]] information to locate the catalytic [[active site]]s of [[enzyme]]s. Alignments are also used to aid in establishing evolutionary relationships by constructing [[phylogenetic tree]]s. Multiple sequence alignments are computationally difficult to produce and most formulations of the problem lead to [[NP-complete]] combinatorial optimization problems.<ref name=wang>{{cite journal \| journal=J Comput Biol \| volume=1 \| pages=337–48 \| year=1994 \|author1=Wang L \|author2=Jiang T. \| title=On the complexity of multiple sequence alignment ~~\| url=http://www.liebertonline.com/doi/abs/10.1089/cmb.1994.1.337~~ \| pmid=8790475 \| doi = 10.1089/cmb.1994.1.337\| issue=4 \| citeseerx=10.1.1.408.894 }}</ref><ref name=elias>{{cite journal \| journal=J Comput Biol \| volume=13 \| pages=1323–1339 \| year=2006 \| author=Elias, Isaac \| title=Settling the intractability of multiple alignment ~~\| url=http://www.liebertonline.com/doi/abs/10.1089/cmb.2006.13.1323~~ \| pmid=17037961 \| doi =10.1089/cmb.2006.13.1323 \| issue=7 \| citeseerx=10.1.1.6.256 }}</ref> Nevertheless, the utility of these alignments in bioinformatics has led to the development of a variety of methods suitable for aligning three or more sequences. ===Dynamic programming=== Line 114: Phylogenetics and sequence alignment are closely related fields due to the shared necessity of evaluating sequence relatedness.<ref name=ortet>{{cite journal\|author1=Ortet P \|author2=Bastien O \| year=2010 \| title=Where Does the Alignment Score Distribution Shape Come from? \| journal= Evolutionary Bioinformatics \| volume=6\| pages=159–187\| pmid = 21258650\| doi = 10.4137/EBO.S5875 \| url=http://www.la-press.com/where-does-the-alignment-score-distribution-shape-come-from-article-a2393\| pmc=3023300}}</ref> The field of [[phylogenetics]] makes extensive use of sequence alignments in the construction and interpretation of [[phylogenetic tree]]s, which are used to classify the evolutionary relationships between homologous [[gene]]s represented in the [[genome]]s of divergent species. The degree to which sequences in a query set differ is qualitatively related to the sequences' evolutionary distance from one another. Roughly speaking, high sequence identity suggests that the sequences in question have a comparatively young [[most recent common ancestor]], while low identity suggests that the divergence is more ancient. This approximation, which reflects the "[[molecular clock]]" hypothesis that a roughly constant rate of evolutionary change can be used to extrapolate the elapsed time since two genes first diverged (that is, the [[coalescence (genetics)\|coalescence]] time), assumes that the effects of mutation and [[natural selection\|selection]] are constant across sequence lineages. Therefore, it does not account for possible difference among organisms or species in the rates of [[DNA repair]] or the possible functional conservation of specific regions in a sequence. (In the case of nucleotide sequences, the molecular clock hypothesis in its most basic form also discounts the difference in acceptance rates between [[silent mutation]]s that do not alter the meaning of a given [[codon]] and other mutations that result in a different [[amino acid]] being incorporated into the protein). More statistically accurate methods allow the evolutionary rate on each branch of the phylogenetic tree to vary, thus producing better estimates of coalescence times for genes. Progressive multiple alignment techniques produce a phylogenetic tree by necessity because they incorporate sequences into the growing alignment in order of relatedness. Other techniques that assemble multiple sequence alignments and phylogenetic trees score and sort trees first and calculate a multiple sequence alignment from the highest-scoring tree. Commonly used methods of phylogenetic tree construction are mainly [[heuristic]] because the problem of selecting the optimal tree, like the problem of selecting the optimal multiple sequence alignment, is [[NP-hard]].<ref name=felsenstein>{{cite book\| author=Felsenstein J. \| year=2004\| title=Inferring Phylogenies \| publisher= Sinauer Associates: Sunderland, MA \| isbn=978-0-87893-177-54}}</ref> ===Assessment of significance=== Line 121: In database searches such as BLAST, statistical methods can determine the likelihood of a particular alignment between sequences or sequence regions arising by chance given the size and composition of the database being searched. These values can vary significantly depending on the search space. In particular, the likelihood of finding a given alignment by chance increases if the database consists only of sequences from the same organism as the query sequence. Repetitive sequences in the database or query can also distort both the search results and the assessment of statistical significance; BLAST automatically filters such repetitive sequences in the query to avoid apparent hits that are statistical artifacts. Methods of statistical significance estimation for gapped sequence alignments are available in the literature.<ref name="ortet"/><ref name=altschul>{{cite ~~journal~~book\|author1=Altschul SF \|author2=Gish W \| year=1996\| title=Local Alignment Statistics\| journal= Meth.Enz. \| volume=266 \| pages = 460–480\|doi=10.1016/S0076-6879(96)66029-7\|series=Methods in Enzymology\|isbn=9780121821678}}</ref><ref name=hartmann>{{cite journal\| author=Hartmann AK\| year=2002\| title=Sampling rare events: statistics of local sequence alignments\| journal= Phys. Rev. E\| volume=65\| page=056102\|doi=10.1103/PhysRevE.65.056102\|~~url~~ pmid=~~http://link.aps.org/doi/10.1103/PhysRevE.65.056102~~12059642\| issue=5\|arxiv=cond-mat/0108201\|bibcode=2002PhRvE..65e6102H}}</ref><ref name=newberg>{{cite journal\| author=Newberg LA \| year=2008 \| title=Significance of gapped sequence alignments \| journal= J Comput Biolo \| volume=15\| pages=1187–1194 \| pmid = 18973434 \| doi=10.1089/cmb.2008.0125\| nopp=true\| issue=9\| pmc=2737730}}</ref><ref name=eddy>{{cite journal\| author=Eddy SR\| year=2008 \| title=A probabilistic model of local sequence alignment that simplifies statistical significance estimation \| journal= PLoS Comput Biol \| volume=4\| editor1-first=Burkhard\| pages=e1000069 \| pmid = 18516236\| editor1-last=Rost \| doi=10.1371/journal.pcbi.1000069\| issue=5\| pmc=2396288\| last2=Rost\| first2=Burkhard\| bibcode=2008PLSCB...4E0069E}}</ref><ref name=bastien>{{cite journal\|author1=Bastien O \|author2=Aude JC \|author3=Roy S \|author4=Marechal E \| year=2004 \| title=Fundamentals of massive automatic pairwise alignments of protein sequences: theoretical significance of Z-value statistics \| journal= Bioinformatics \| volume=20\| issue=4\| pages=534–537\| pmid = 14990449\| doi = 10.1093/bioinformatics/btg440 \| url=http://bioinformatics.oxfordjournals.org/content/20/4/534.long}}</ref><ref name=agrawal11>{{cite journal\|author1=Agrawal A \|author2=Huang X \| year=2011\| title=Pairwise Statistical Significance of Local Sequence Alignment Using Sequence-Specific and Position-Specific Substitution Matrices\|journal= IEEE/ACM Transactions on Computational Biology and Bioinformatics\| volume=8\| pages=194–205\|doi=10.1109/TCBB.2009.69\|pmid=21071807 \|url=http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5276793\|archive-url=https://archive.is/20130415004914/http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5276793\|dead-url=yes\|archive-date=2013-04-15\| issue=1}}</ref><ref name=agrawal08>{{cite journal\| author1=Agrawal A\| author2=Brendel VP\| author3=Huang X\| year=2008\| title=Pairwise statistical significance and empirical determination of effective gap opening penalties for protein local sequence alignment\| journal=International Journal of Computational Biology and Drug Design\| volume=1\| pages=347–367\| doi=10.1504/IJCBDD.2008.022207\| url=http://inderscience.metapress.com/content/1558538106522500/\| issue=4\| deadurl=yes\| archiveurl=https://archive.is/20130128163812/http://inderscience.metapress.com/content/1558538106522500/\| archivedate=28 January 2013\| df=dmy-all}}</ref> ===Assessment of credibility=== Line 133: ==Other biological uses== Sequenced RNA, such as [[expressed sequence tags]] and full-length mRNAs, can be aligned to a sequenced genome to find where there are genes and get information about [[alternative splicing]]<ref>{{cite ~~journal~~book \|author1=Kim N \|author2=Lee C \|title=Bioinformatics detection of alternative splicing \|journal=Methods Mol. Biol. \|volume=452 \|issue= \|pages=179–97 \|year=2008 \|pmid=18566765 \|doi=10.1007/978-1-60327-159-2_9 \|series=Methods in Molecular Biology™ \|isbn=978-1-58829-707-5}}</ref> and [[RNA editing]].<ref>{{cite journal \|vauthors=Li JB, Levanon EY, Yoon JK, etal \|title=Genome-wide identification of human RNA editing sites by parallel DNA capturing and sequencing \|journal=Science \|volume=324 \|issue=5931 \|pages=1210–3 \|date=May 2009 \|pmid=19478186 \|doi=10.1126/science.1170995\|bibcode=2009Sci...324.1210L }}</ref> Sequence alignment is also a part of [[genome assembly]], where sequences are aligned to find overlap so that ''[[contig]]s'' (long stretches of sequence) can be formed.<ref>{{cite journal \|vauthors=Blazewicz J, Bryja M, Figlerowicz M, etal \|title=Whole genome assembly from 454 sequencing output via modified DNA graph concept \|journal=Comput Biol Chem \|volume=33 \|issue=3 \|pages=224–30 \|date=June 2009 \|pmid=19477687 \|doi=10.1016/j.compbiolchem.2009.04.005}}</ref> Another use is [[single nucleotide polymorphism\|SNP]] analysis, where sequences from different individuals are aligned to find single basepairs that are often different in a population.<ref>{{cite journal \|author1=Duran C \|author2=Appleby N \|author3=Vardy M \|author4=Imelfort M \|author5=Edwards D \|author6=Batley J \|title=Single nucleotide polymorphism discovery in barley using autoSNPdb \|journal=Plant Biotechnol. J. \|volume=7 \|issue=4 \|pages=326–33 \|date=May 2009 \|pmid=19386041 \|doi=10.1111/j.1467-7652.2009.00407.x }}</ref> ==Non-biological uses== Line 140: ==Software== {{Main\|Sequence alignment software}} A more complete list of available software categorized by algorithm and alignment type is available at [[sequence alignment software]], but common software tools used for general sequence alignment tasks include ClustalW2<ref>{{cite web\|url=http://www.ebi.ac.uk/Tools/msa/clustalw2/\|title=ClustalW2 < Multiple Sequence Alignment < EMBL-EBI~~\|first=~~\|last=EMBL-EBI\|date=\|website=www.EBI.ac.uk\|access-date=12 June 2017}}</ref> and T-coffee<ref>[https://web.archive.org/web/20080918022531/http://tcoffee.vital-it.ch/cgi-bin/Tcoffee/tcoffee_cgi/index.cgi T-coffee]</ref> for alignment, and BLAST<ref>{{cite web\|url=http://blast.ncbi.nlm.nih.gov/Blast.cgi\|title=BLAST: Basic Local Alignment Search Tool\|author=\|date=\|website=blast.ncbi.nlm.NIH.gov\|access-date=12 June 2017}}</ref> and FASTA3x<ref>{{cite web\|url=http://fasta.bioch.virginia.edu/fasta_www2/fasta_list2.shtml\|title=UVA FASTA Server\|author=\|date=\|website=fasta.bioch.Virginia.edu\|access-date=12 June 2017}}</ref> for database searching. Commercial tools such as [[DNASTAR\|DNASTAR Lasergene]], [[Geneious]], and [[PatternHunter]] are also available. Tools annotated as performing [http://edamontology.org/operation_0292 sequence alignment] are listed in the [https://bio.tools/?page=1&function=%22Sequence%20alignment%22&sort=score bio.tools] registry. Alignment algorithms and software can be directly compared to one another using a standardized set of [[Benchmark (computing)\|benchmark]] reference multiple sequence alignments known as BAliBASE.<ref name=thompson2>{{cite journal \| journal=Bioinformatics \| volume=15 \| pages=87–8 \| year=1999 \|author1=Thompson JD \|author2=Plewniak F \|author3=Poch O \| title=BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs \| url=http://bioinformatics.oxfordjournals.org/cgi/pmidlookup?view=long&pmid=10068696 \| pmid=10068696 \| doi = 10.1093/bioinformatics/15.1.87 \| issue=1 }}</ref> The data set consists of structural alignments, which can be considered a standard against which purely sequence-based methods are compared. The relative performance of many common alignment methods on frequently encountered alignment problems has been tabulated and selected results published online at BAliBASE.<ref>[https://web.archive.org/web/20121130084356/http://bips.u-strasbg.fr/fr/Products/Databases/BAliBASE/prog_scores.html BAliBASE]</ref><ref name=thompson3>{{cite journal \| journal=Nucleic Acids Res \| volume=27 \| pages=2682–90 \| year=1999 \|author1=Thompson JD \|author2=Plewniak F \|author3=Poch O. \| title=A comprehensive comparison of multiple sequence alignment programs \| url=http://nar.oxfordjournals.org/cgi/pmidlookup?view=long&pmid=10373585 \| pmid=10373585 \| doi = 10.1093/nar/27.13.2682 \| issue=13 \| pmc=148477 }}</ref> A comprehensive list of BAliBASE scores for many (currently 12) different alignment tools can be computed within the protein workbench STRAP.<ref>{{cite web\|url=http://3d-alignment.eu/\|title=Multiple sequence alignment: Strap\|author=\|date=\|website=3d-alignment.eu\|access-date=12 June 2017}}</ref>

Sequence alignment: Difference between revisions