Content deleted Content added
Citation bot (talk | contribs) m Alter: isbn, template type. Add: pmid, citeseerx. Removed parameters. You can use this bot yourself. Report bugs here. | User-activated; Category:Bioinformatics. |
|||
Line 1:
{{Refimprove|date=March 2009}}
In [[bioinformatics]], a '''sequence alignment''' is a way of arranging the sequences of [[DNA]], [[RNA]], or protein to identify regions of similarity that may be a consequence of functional, [[structural biology|structural]], or [[evolution]]ary relationships between the sequences.<ref name=mount>{{cite book| author=Mount DM.| year=2004 | title=Bioinformatics: Sequence and Genome Analysis |edition=2nd | publisher= Cold Spring Harbor Laboratory Press: Cold Spring Harbor, NY. |isbn=978-0-87969-608-
Sequence alignments are also used for non-biological sequences, such as calculating the [[Edit distance|edit distance cost]] between strings in a [[natural language]] or in financial data.
Line 74:
{{Main|Multiple sequence alignment}}
[[Image:Hemagglutinin-alignments.png|right|thumb|300px|Alignment of 27 [[avian influenza]] [[hemagglutinin]] protein sequences colored by residue conservation (top) and residue properties (bottom)]]
[[Multiple sequence alignment]] is an extension of pairwise alignment to incorporate more than two sequences at a time. Multiple alignment methods try to align all of the sequences in a given query set. Multiple alignments are often used in identifying [[conservation (genetics)|conserved]] sequence regions across a group of sequences hypothesized to be evolutionarily related. Such conserved sequence motifs can be used in conjunction with structural and [[reaction mechanism|mechanistic]] information to locate the catalytic [[active site]]s of [[enzyme]]s. Alignments are also used to aid in establishing evolutionary relationships by constructing [[phylogenetic tree]]s. Multiple sequence alignments are computationally difficult to produce and most formulations of the problem lead to [[NP-complete]] combinatorial optimization problems.<ref name=wang>{{cite journal | journal=J Comput Biol | volume=1 | pages=337–48 | year=1994 |author1=Wang L |author2=Jiang T. | title=On the complexity of multiple sequence alignment
===Dynamic programming===
Line 114:
Phylogenetics and sequence alignment are closely related fields due to the shared necessity of evaluating sequence relatedness.<ref name=ortet>{{cite journal|author1=Ortet P |author2=Bastien O | year=2010 | title=Where Does the Alignment Score Distribution Shape Come from? | journal= Evolutionary Bioinformatics | volume=6| pages=159–187| pmid = 21258650| doi = 10.4137/EBO.S5875 | url=http://www.la-press.com/where-does-the-alignment-score-distribution-shape-come-from-article-a2393| pmc=3023300}}</ref> The field of [[phylogenetics]] makes extensive use of sequence alignments in the construction and interpretation of [[phylogenetic tree]]s, which are used to classify the evolutionary relationships between homologous [[gene]]s represented in the [[genome]]s of divergent species. The degree to which sequences in a query set differ is qualitatively related to the sequences' evolutionary distance from one another. Roughly speaking, high sequence identity suggests that the sequences in question have a comparatively young [[most recent common ancestor]], while low identity suggests that the divergence is more ancient. This approximation, which reflects the "[[molecular clock]]" hypothesis that a roughly constant rate of evolutionary change can be used to extrapolate the elapsed time since two genes first diverged (that is, the [[coalescence (genetics)|coalescence]] time), assumes that the effects of mutation and [[natural selection|selection]] are constant across sequence lineages. Therefore, it does not account for possible difference among organisms or species in the rates of [[DNA repair]] or the possible functional conservation of specific regions in a sequence. (In the case of nucleotide sequences, the molecular clock hypothesis in its most basic form also discounts the difference in acceptance rates between [[silent mutation]]s that do not alter the meaning of a given [[codon]] and other mutations that result in a different [[amino acid]] being incorporated into the protein). More statistically accurate methods allow the evolutionary rate on each branch of the phylogenetic tree to vary, thus producing better estimates of coalescence times for genes.
Progressive multiple alignment techniques produce a phylogenetic tree by necessity because they incorporate sequences into the growing alignment in order of relatedness. Other techniques that assemble multiple sequence alignments and phylogenetic trees score and sort trees first and calculate a multiple sequence alignment from the highest-scoring tree. Commonly used methods of phylogenetic tree construction are mainly [[heuristic]] because the problem of selecting the optimal tree, like the problem of selecting the optimal multiple sequence alignment, is [[NP-hard]].<ref name=felsenstein>{{cite book| author=Felsenstein J. | year=2004| title=Inferring Phylogenies | publisher= Sinauer Associates: Sunderland, MA | isbn=978-0-87893-177-
===Assessment of significance===
Line 121:
In database searches such as BLAST, statistical methods can determine the likelihood of a particular alignment between sequences or sequence regions arising by chance given the size and composition of the database being searched. These values can vary significantly depending on the search space. In particular, the likelihood of finding a given alignment by chance increases if the database consists only of sequences from the same organism as the query sequence. Repetitive sequences in the database or query can also distort both the search results and the assessment of statistical significance; BLAST automatically filters such repetitive sequences in the query to avoid apparent hits that are statistical artifacts.
Methods of statistical significance estimation for gapped sequence alignments are available in the literature.<ref name="ortet"/><ref name=altschul>{{cite
journal= Phys. Rev. E| volume=65| page=056102|doi=10.1103/PhysRevE.65.056102|
===Assessment of credibility===
Line 133:
==Other biological uses==
Sequenced RNA, such as [[expressed sequence tags]] and full-length mRNAs, can be aligned to a sequenced genome to find where there are genes and get information about [[alternative splicing]]<ref>{{cite
==Non-biological uses==
Line 140:
==Software==
{{Main|Sequence alignment software}}
A more complete list of available software categorized by algorithm and alignment type is available at [[sequence alignment software]], but common software tools used for general sequence alignment tasks include ClustalW2<ref>{{cite web|url=http://www.ebi.ac.uk/Tools/msa/clustalw2/|title=ClustalW2 < Multiple Sequence Alignment < EMBL-EBI
Alignment algorithms and software can be directly compared to one another using a standardized set of [[Benchmark (computing)|benchmark]] reference multiple sequence alignments known as BAliBASE.<ref name=thompson2>{{cite journal | journal=Bioinformatics | volume=15 | pages=87–8 | year=1999 |author1=Thompson JD |author2=Plewniak F |author3=Poch O | title=BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs | url=http://bioinformatics.oxfordjournals.org/cgi/pmidlookup?view=long&pmid=10068696 | pmid=10068696 | doi = 10.1093/bioinformatics/15.1.87 | issue=1 }}</ref> The data set consists of structural alignments, which can be considered a standard against which purely sequence-based methods are compared. The relative performance of many common alignment methods on frequently encountered alignment problems has been tabulated and selected results published online at BAliBASE.<ref>[https://web.archive.org/web/20121130084356/http://bips.u-strasbg.fr/fr/Products/Databases/BAliBASE/prog_scores.html BAliBASE]</ref><ref name=thompson3>{{cite journal | journal=Nucleic Acids Res | volume=27 | pages=2682–90 | year=1999 |author1=Thompson JD |author2=Plewniak F |author3=Poch O. | title=A comprehensive comparison of multiple sequence alignment programs | url=http://nar.oxfordjournals.org/cgi/pmidlookup?view=long&pmid=10373585 | pmid=10373585 | doi = 10.1093/nar/27.13.2682 | issue=13 | pmc=148477 }}</ref> A comprehensive list of BAliBASE scores for many (currently 12) different alignment tools can be computed within the protein workbench STRAP.<ref>{{cite web|url=http://3d-alignment.eu/|title=Multiple sequence alignment: Strap|author=|date=|website=3d-alignment.eu|access-date=12 June 2017}}</ref>
|