Sequence alignment: Difference between revisions

Content deleted Content added
OAbot (talk | contribs)
m Open access bot: url-access updated in citation with #oabot.
Pibabel (talk | contribs)
Maximal unique match: Deleted errant mid-sentence period.
Tags: Mobile edit Mobile web edit
 
(4 intermediate revisions by 4 users not shown)
Line 89:
Global alignments, which attempt to align every residue in every sequence, are most useful when the sequences in the query set are similar and of roughly equal size. (This does not mean global alignments cannot start and/or end in gaps.) A general global alignment technique is the [[Needleman–Wunsch algorithm]], which is based on dynamic programming. Local alignments are more useful for dissimilar sequences that are suspected to contain regions of similarity or similar sequence motifs within their larger sequence context. The [[Smith–Waterman algorithm]] is a general local alignment method based on the same dynamic programming scheme but with additional choices to start and end at any place.<ref name="Polyanovsky2011"/>
 
Hybrid methods, known as semi-global or "glocal" (short for '''glo'''bal-lo'''cal''') methods, search for the best possible partial alignment of the two sequences (in other words, a combination of one or both starts and one or both ends is stated to be aligned). This can be especially useful when the downstream part of one sequence overlaps with the upstream part of the other sequence. In this case, neither global nor local alignment is entirely appropriate: a global alignment would attempt to force the alignment to extend beyond the region of overlap, while a local alignment might not fully cover the region of overlap.<ref name=brudno>{{cite journal|author1=Brudno M |author2=Malde S |author3=Poliakov A |author4=Do CB |author5=Couronne O |author6=Dubchak I |author7=Batzoglou S | year=2003 | title=Glocal alignment: finding rearrangements during alignment | journal= Bioinformatics | volume=Suppl 1| issue=90001| pages=i54–62| series=19 | pmid = 12855437| doi = 10.1093/bioinformatics/btg1005 | doi-access=free }}</ref> Another case where semi-global alignment is useful is when one sequence is short (for example a gene sequence) and the other is very long (for example a chromosome sequence). In that case, the short sequence should be globally (fully) aligned but only a local (partial) alignment is desired for the long sequence.
 
Fast expansion of genetic data challenges speed of current DNA sequence alignment algorithms. Essential needs for an efficient and accurate method for DNA variant discovery demand innovative approaches for parallel processing in real time. [[Optical computing]] approaches have been suggested as promising alternatives to the current electrical implementations, yet their applicability remains to be tested [https://onlinelibrary.wiley.com/doi/abs/10.1002/jbio.201900227].
Line 97:
 
===Maximal unique match===
One way of quantifying the utility of a given pairwise alignment is the '[[maximal unique match]]' (MUM), or the longest subsequence that occurs in both query sequences. Longer MUM sequences typically reflect closer relatedness. <ref name="Alignment of whole genomes">{{cite journal |last1=Delcher |first1=A. L. |last2=Kasif |first2=S. |last3=Fleishmann |first3=R.D. |last4=Peterson |first4=J. |last5=White |first5=O. |last6=Salzberg |first6=S.L. |title=Alignment of whole genomes |journal=Nucleic Acids Research |date=1999 |volume=27 |issue=11 |pages=2369–2376 |doi=10.1093/nar/30.11.2478 |pmid=10325427|pmc=148804 |doi-access=free }}</ref> in the [[multiple sequence alignment]] of [[genomes]] in [[computational biology]]. Identification of MUMs and other potential anchors, is the first step in larger alignment systems such as [[MUMmer]]. Anchors are the areas between two genomes where they are highly similar. To understand what a MUM is we can break down each word in the acronym. Match implies that the substring occurs in both sequences to be aligned. Unique means that the substring occurs only once in each sequence. Finally, maximal states that the substring is not part of another larger string that fulfills both prior requirements. The idea behind this, is that long sequences that match exactly and occur only once in each genome are almost certainly part of the global alignment.
 
More precisely:
Line 131:
Word methods, also known as ''k''-tuple methods, are [[heuristic]] methods that are not guaranteed to find an optimal alignment solution, but are significantly more efficient than dynamic programming. These methods are especially useful in large-scale database searches where it is understood that a large proportion of the candidate sequences will have essentially no significant match with the query sequence. Word methods are best known for their implementation in the database search tools [[FASTA]] and the [[BLAST (biotechnology)|BLAST]] family.<ref name=mount/> Word methods identify a series of short, nonoverlapping subsequences ("words") in the query sequence that are then matched to candidate database sequences. The relative positions of the word in the two sequences being compared are subtracted to obtain an offset; this will indicate a region of alignment if multiple distinct words produce the same offset. Only if this region is detected do these methods apply more sensitive alignment criteria; thus, many unnecessary comparisons with sequences of no appreciable similarity are eliminated.
 
In the FASTA method, the user defines a value ''k'' to use as the word length with which to search the database. The method is slower but more sensitive at lower values of ''k'', which are also preferred for searches involving a very short query sequence. The BLAST family of search methods provides a number of algorithms optimized for particular types of queries, such as searching for distantly related sequence matches. BLAST was developed to provide a faster alternative to FASTA without sacrificing much accuracy; like FASTA, BLAST uses a word search of length ''k'', but evaluates only the most significant word matches, rather than every word match as does FASTA. Most BLAST implementations use a fixed default word length that is optimized for the query and database type, and that is changed only under special circumstances, such as when searching with repetitive or very short query sequences. Implementations can be found via a number of web portals, such as [http://www.ebi.ac.uk/fasta33/ EMBL FASTA] and [https://web.archive.org/web/19970615060854/http://www.ncbi.nlm.nih.gov/BLAST/ NCBI BLAST].
 
==Multiple sequence alignment==
Line 139:
 
===Dynamic programming===
The technique of dynamic programming is theoretically applicable to any number of sequences; however, because it is computationally expensive in both time and [[computer memory|memory]], it is rarely used for more than three or four sequences in its most basic form. This method requires constructing the ''n''-dimensional equivalent of the sequence matrix formed from two sequences, where ''n'' is the number of sequences in the query. Standard dynamic programming is first used on all pairs of query sequences and then the "alignment space" is filled in by considering possible matches or gaps at intermediate positions, eventually constructing an alignment essentially between each two-sequence alignment. Although this technique is computationally expensive, its guarantee of a global optimum solution is useful in cases where only a few sequences need to be aligned accurately. One method for reducing the computational demands of dynamic programming, which relies on the "sum of pairs" [[objective function]], has been implemented in the [https://archive.today/20120805063524/http://www.ncbi.nlm.nih.gov/CBBresearch/Schaffer/msa.html MSA] software package.<ref name=lipman>{{cite journal | journal=Proc Natl Acad Sci USA | volume=86 | pages=4412–5 | year=1989 |author1=Lipman DJ |author2=Altschul SF |author3=Kececioglu JD | title=A tool for multiple sequence alignment | pmid=2734293 | doi=10.1073/pnas.86.12.4412 | issue=12 | pmc=287279 | bibcode=1989PNAS...86.4412L | doi-access=free }}</ref>
 
===Progressive methods===
Line 172:
 
===Combinatorial extension===
The combinatorial extension method of structural alignment generates a pairwise structural alignment by using local geometry to align short fragments of the two proteins being analyzed and then assembles these fragments into a larger alignment.<ref name=shindyalov>{{cite journal | journal=Protein Eng | volume=11 | pages=739–47 | year=1998 |author1=Shindyalov IN |author2=Bourne PE. | title=Protein structure alignment by incremental combinatorial extension (CE) of the optimal path | pmid=9796821 | doi = 10.1093/protein/11.9.739 | issue=9 | doi-access=free }}</ref> Based on measures such as rigid-body [[Root mean square deviation (bioinformatics)|root mean square distance]], residue distances, local secondary structure, and surrounding environmental features such as residue neighbor [[hydrophobic]]ity, local alignments called "aligned fragment pairs" are generated and used to build a similarity matrix representing all possible structural alignments within predefined cutoff criteria. A path from one protein structure state to the other is then traced through the matrix by extending the growing alignment one fragment at a time. The optimal such path defines the combinatorial-extension alignment. A web-based server implementing the method and providing a database of pairwise alignments of structures in the Protein Data Bank is located at the [https://web.archive.org/web/19981203071023/http://cl.sdsc.edu/ Combinatorial Extension] website.
 
==Phylogenetic analysis==
Line 200:
 
==Non-biological uses==
The methods used for biological sequence alignment have also found applications in other fields, most notably in [[natural language processing]] and in [[Sequence analysis in social sciences|social sciences]], where the [[Needleman-Wunsch algorithm]] is usually referred to as [[Optimal matching]].<ref>{{cite journal|author1=Abbott A. |author2=Tsay A. | year=2000 | title=Sequence Analysis and Optimal Matching Methods in Sociology, Review and Prospect | journal=Sociological Methods and Research | volume=29|issue=1 | pages=3–33 | doi=10.1177/0049124100029001001|s2cid=121097811 }}</ref> Techniques that generate the set of elements from which words will be selected in [[natural language generation|natural-language generation]] algorithms have borrowed multiple sequence alignment techniques from bioinformatics to produce linguistic versions of [[automated theorem proving|computer-generated mathematical proofs]].<ref name=Barzilay>{{cite book|author1=Barzilay R |author2=Lee L. |title=Proceedings of the ACL-02 conference on Empirical methods in natural language processing - EMNLP '02 |chapter=Bootstrapping lexical choice via multiple-sequence alignment |year=2002 | pages=164–171 | chapter-url=httphttps://www.cs.cornell.edu/home/llee/papers/gen-msa.pdf| volume=10| doi=10.3115/1118693.1118715|arxiv=cs/0205065|bibcode=2002cs........5065B |s2cid=7521453 }}</ref> In the field of historical and comparative [[linguistics]], sequence alignment has been used to partially automate the [[comparative method (linguistics)|comparative method]] by which linguists traditionally reconstruct languages.<ref>{{cite thesis |author=Kondrak, Grzegorz |title=Algorithms for Language Reconstruction |publisher=University of Toronto |year=2002 |url=http://www.cs.ualberta.ca/~kondrak/papers/thesis.pdf |access-date=2007-01-21 |archive-url=https://web.archive.org/web/20081217043010/http://www.cs.ualberta.ca/~kondrak/papers/thesis.pdf |archive-date=17 December 2008 |url-status=dead }}</ref> Business and marketing research has also applied multiple sequence alignment techniques in analyzing series of purchases over time.<ref name=prinzie>{{cite journal|author1=Prinzie A. |author2=D. Van den Poel |year=2006 | url=http://econpapers.repec.org/paper/rugrugwps/05_2F292.htm | title=Incorporating sequential information into traditional classification models by using an element/position-sensitive SAM | journal=Decision Support Systems | volume=42 | issue=2| pages= 508–526 | doi=10.1016/j.dss.2005.02.004| url-access=subscription }} See also Prinzie and Van den Poel's paper {{cite journal | url=http://econpapers.repec.org/paper/rugrugwps/07_2F442.htm | title=Predicting home-appliance acquisition sequences: Markov/Markov for Discrimination and survival analysis for modeling sequential information in NPTB models | year=2007 | journal=Decision Support Systems | volume=44 | issue=1 | pages= 28–45 | doi=10.1016/j.dss.2007.02.008 | author=Prinzie, A | last2=Vandenpoel | first2=D| url-access=subscription }}</ref>
 
==Software==
{{Main|Sequence alignment software}}
A more complete list of available software categorized by algorithm and alignment type is available at [[sequence alignment software]], but common software tools used for general sequence alignment tasks include ClustalW2<ref>{{cite web|url=http://www.ebi.ac.uk/Tools/msa/clustalw2/|title=ClustalW2 < Multiple Sequence Alignment < EMBL-EBI|last=EMBL-EBI|website=www.EBI.ac.uk|access-date=12 June 2017}}</ref> and T-coffee<ref>[https://web.archive.org/web/20080918022531/http://tcoffee.vital-it.ch/cgi-bin/Tcoffee/tcoffee_cgi/index.cgi T-coffee]</ref> for alignment, and BLAST<ref>{{cite web|url=httphttps://blast.ncbi.nlm.nih.gov/Blast.cgi|title=BLAST: Basic Local Alignment Search Tool|website=blast.ncbi.nlm.NIH.gov|access-date=12 June 2017}}</ref> and FASTA3x<ref>{{cite web|url=http://fasta.bioch.virginia.edu/fasta_www2/fasta_list2.shtml|title=UVA FASTA Server|website=fasta.bioch.Virginia.edu|access-date=12 June 2017}}</ref> for database searching. Commercial tools such as [[DNASTAR|DNASTAR Lasergene]], [[Geneious]], and [[PatternHunter]] are also available. Tools annotated as performing [http://edamontology.org/operation_0292 sequence alignment] are listed in the [https://bio.tools/?page=1&function=%22Sequence%20alignment%22&sort=score bio.tools] registry.
 
Alignment algorithms and software can be directly compared to one another using a standardized set of [[Benchmark (computing)|benchmark]] reference multiple sequence alignments known as BAliBASE.<ref name=thompson2>{{cite journal | journal=Bioinformatics | volume=15 | pages=87–8 | year=1999 |author1=Thompson JD |author2=Plewniak F |author3=Poch O | title=BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs | pmid=10068696 | doi = 10.1093/bioinformatics/15.1.87 | issue=1 | doi-access=free }}</ref> The data set consists of structural alignments, which can be considered a standard against which purely sequence-based methods are compared. The relative performance of many common alignment methods on frequently encountered alignment problems has been tabulated and selected results published online at BAliBASE.<ref>[https://web.archive.org/web/20121130084356/http://bips.u-strasbg.fr/fr/Products/Databases/BAliBASE/prog_scores.html BAliBASE]</ref><ref name=thompson3>{{cite journal | journal=Nucleic Acids Res | volume=27 | pages=2682–90 | year=1999 |author1=Thompson JD |author2=Plewniak F |author3=Poch O. | title=A comprehensive comparison of multiple sequence alignment programs | url= | pmid=10373585 | doi = 10.1093/nar/27.13.2682 | issue=13 | pmc=148477 }}</ref> A comprehensive list of BAliBASE scores for many (currently 12) different alignment tools can be computed within the protein workbench STRAP.<ref>{{cite web|url=http://3d-alignment.eu/|title=Multiple sequence alignment: Strap|website=3d-alignment.eu|access-date=12 June 2017}}</ref>