Sequence alignment: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 09:31, 31 May 2025 edit Nemo bis (talk \| contribs) Extended confirmed users 39,751 edits Open access status updates in citations with OAbot #oabot Tag: OAbot [2.1] ← Previous edit		Latest revision as of 13:05, 25 August 2025 edit undo Pibabel (talk \| contribs) 26 edits →Maximal unique match: Deleted errant mid-sentence period. Tags: Mobile edit Mobile web edit
(3 intermediate revisions by 3 users not shown)
Line 97: ===Maximal unique match=== One way of quantifying the utility of a given pairwise alignment is the '[[maximal unique match]]' (MUM), or the longest subsequence that occurs in both query sequences. Longer MUM sequences typically reflect closer relatedness. <ref name="Alignment of whole genomes">{{cite journal \|last1=Delcher \|first1=A. L. \|last2=Kasif \|first2=S. \|last3=Fleishmann \|first3=R.D. \|last4=Peterson \|first4=J. \|last5=White \|first5=O. \|last6=Salzberg \|first6=S.L. \|title=Alignment of whole genomes \|journal=Nucleic Acids Research \|date=1999 \|volume=27 \|issue=11 \|pages=2369–2376 \|doi=10.1093/nar/30.11.2478 \|pmid=10325427\|pmc=148804 \|doi-access=free }}</ref> in the [[multiple sequence alignment]] of [[genomes]] in [[computational biology]]. Identification of MUMs and other potential anchors, is the first step in larger alignment systems such as [[MUMmer]]. Anchors are the areas between two genomes where they are highly similar. To understand what a MUM is we can break down each word in the acronym. Match implies that the substring occurs in both sequences to be aligned. Unique means that the substring occurs only once in each sequence. Finally, maximal states that the substring is not part of another larger string that fulfills both prior requirements. The idea behind this, is that long sequences that match exactly and occur only once in each genome are almost certainly part of the global alignment. More precisely: Line 131: Word methods, also known as ''k''-tuple methods, are [[heuristic]] methods that are not guaranteed to find an optimal alignment solution, but are significantly more efficient than dynamic programming. These methods are especially useful in large-scale database searches where it is understood that a large proportion of the candidate sequences will have essentially no significant match with the query sequence. Word methods are best known for their implementation in the database search tools [[FASTA]] and the [[BLAST (biotechnology)\|BLAST]] family.<ref name=mount/> Word methods identify a series of short, nonoverlapping subsequences ("words") in the query sequence that are then matched to candidate database sequences. The relative positions of the word in the two sequences being compared are subtracted to obtain an offset; this will indicate a region of alignment if multiple distinct words produce the same offset. Only if this region is detected do these methods apply more sensitive alignment criteria; thus, many unnecessary comparisons with sequences of no appreciable similarity are eliminated. In the FASTA method, the user defines a value ''k'' to use as the word length with which to search the database. The method is slower but more sensitive at lower values of ''k'', which are also preferred for searches involving a very short query sequence. The BLAST family of search methods provides a number of algorithms optimized for particular types of queries, such as searching for distantly related sequence matches. BLAST was developed to provide a faster alternative to FASTA without sacrificing much accuracy; like FASTA, BLAST uses a word search of length ''k'', but evaluates only the most significant word matches, rather than every word match as does FASTA. Most BLAST implementations use a fixed default word length that is optimized for the query and database type, and that is changed only under special circumstances, such as when searching with repetitive or very short query sequences. Implementations can be found via a number of web portals, such as [http://www.ebi.ac.uk/fasta33/ EMBL FASTA] and [https://web.archive.org/web/19970615060854/http://www.ncbi.nlm.nih.gov/BLAST/ NCBI BLAST]. ==Multiple sequence alignment== Line 139: ===Dynamic programming=== The technique of dynamic programming is theoretically applicable to any number of sequences; however, because it is computationally expensive in both time and [[computer memory\|memory]], it is rarely used for more than three or four sequences in its most basic form. This method requires constructing the ''n''-dimensional equivalent of the sequence matrix formed from two sequences, where ''n'' is the number of sequences in the query. Standard dynamic programming is first used on all pairs of query sequences and then the "alignment space" is filled in by considering possible matches or gaps at intermediate positions, eventually constructing an alignment essentially between each two-sequence alignment. Although this technique is computationally expensive, its guarantee of a global optimum solution is useful in cases where only a few sequences need to be aligned accurately. One method for reducing the computational demands of dynamic programming, which relies on the "sum of pairs" [[objective function]], has been implemented in the [https://archive.today/20120805063524/http://www.ncbi.nlm.nih.gov/CBBresearch/Schaffer/msa.html MSA] software package.<ref name=lipman>{{cite journal \| journal=Proc Natl Acad Sci USA \| volume=86 \| pages=4412–5 \| year=1989 \|author1=Lipman DJ \|author2=Altschul SF \|author3=Kececioglu JD \| title=A tool for multiple sequence alignment \| pmid=2734293 \| doi=10.1073/pnas.86.12.4412 \| issue=12 \| pmc=287279 \| bibcode=1989PNAS...86.4412L \| doi-access=free }}</ref> ===Progressive methods=== Line 172: ===Combinatorial extension=== The combinatorial extension method of structural alignment generates a pairwise structural alignment by using local geometry to align short fragments of the two proteins being analyzed and then assembles these fragments into a larger alignment.<ref name=shindyalov>{{cite journal \| journal=Protein Eng \| volume=11 \| pages=739–47 \| year=1998 \|author1=Shindyalov IN \|author2=Bourne PE. \| title=Protein structure alignment by incremental combinatorial extension (CE) of the optimal path \| pmid=9796821 \| doi = 10.1093/protein/11.9.739 \| issue=9 \| doi-access=~~free~~ }}</ref> Based on measures such as rigid-body [[Root mean square deviation (bioinformatics)\|root mean square distance]], residue distances, local secondary structure, and surrounding environmental features such as residue neighbor [[hydrophobic]]ity, local alignments called "aligned fragment pairs" are generated and used to build a similarity matrix representing all possible structural alignments within predefined cutoff criteria. A path from one protein structure state to the other is then traced through the matrix by extending the growing alignment one fragment at a time. The optimal such path defines the combinatorial-extension alignment. A web-based server implementing the method and providing a database of pairwise alignments of structures in the Protein Data Bank is located at the [https://web.archive.org/web/19981203071023/http://cl.sdsc.edu/ Combinatorial Extension] website. ==Phylogenetic analysis== Line 200: ==Non-biological uses== The methods used for biological sequence alignment have also found applications in other fields, most notably in [[natural language processing]] and in [[Sequence analysis in social sciences\|social sciences]], where the [[Needleman-Wunsch algorithm]] is usually referred to as [[Optimal matching]].<ref>{{cite journal\|author1=Abbott A. \|author2=Tsay A. \| year=2000 \| title=Sequence Analysis and Optimal Matching Methods in Sociology, Review and Prospect \| journal=Sociological Methods and Research \| volume=29\|issue=1 \| pages=3–33 \| doi=10.1177/0049124100029001001\|s2cid=121097811 }}</ref> Techniques that generate the set of elements from which words will be selected in [[natural language generation\|natural-language generation]] algorithms have borrowed multiple sequence alignment techniques from bioinformatics to produce linguistic versions of [[automated theorem proving\|computer-generated mathematical proofs]].<ref name=Barzilay>{{cite book\|author1=Barzilay R \|author2=Lee L. \|title=Proceedings of the ACL-02 conference on Empirical methods in natural language processing - EMNLP '02 \|chapter=Bootstrapping lexical choice via multiple-sequence alignment \|year=2002 \| pages=164–171 \| chapter-url=~~http~~https://www.cs.cornell.edu/home/llee/papers/gen-msa.pdf\| volume=10\| doi=10.3115/1118693.1118715\|arxiv=cs/0205065\|bibcode=2002cs........5065B \|s2cid=7521453 }}</ref> In the field of historical and comparative [[linguistics]], sequence alignment has been used to partially automate the [[comparative method (linguistics)\|comparative method]] by which linguists traditionally reconstruct languages.<ref>{{cite thesis \|author=Kondrak, Grzegorz \|title=Algorithms for Language Reconstruction \|publisher=University of Toronto \|year=2002 \|url=http://www.cs.ualberta.ca/~kondrak/papers/thesis.pdf \|access-date=2007-01-21 \|archive-url=https://web.archive.org/web/20081217043010/http://www.cs.ualberta.ca/~kondrak/papers/thesis.pdf \|archive-date=17 December 2008 \|url-status=dead }}</ref> Business and marketing research has also applied multiple sequence alignment techniques in analyzing series of purchases over time.<ref name=prinzie>{{cite journal\|author1=Prinzie A. \|author2=D. Van den Poel \|year=2006 \| url=http://econpapers.repec.org/paper/rugrugwps/05_2F292.htm \| title=Incorporating sequential information into traditional classification models by using an element/position-sensitive SAM \| journal=Decision Support Systems \| volume=42 \| issue=2\| pages= 508–526 \| doi=10.1016/j.dss.2005.02.004\| url-access=subscription }} See also Prinzie and Van den Poel's paper {{cite journal \| url=http://econpapers.repec.org/paper/rugrugwps/07_2F442.htm \| title=Predicting home-appliance acquisition sequences: Markov/Markov for Discrimination and survival analysis for modeling sequential information in NPTB models \| year=2007 \| journal=Decision Support Systems \| volume=44 \| issue=1 \| pages= 28–45 \| doi=10.1016/j.dss.2007.02.008 \| author=Prinzie, A \| last2=Vandenpoel \| first2=D\| url-access=subscription }}</ref> ==Software== {{Main\|Sequence alignment software}} A more complete list of available software categorized by algorithm and alignment type is available at [[sequence alignment software]], but common software tools used for general sequence alignment tasks include ClustalW2<ref>{{cite web\|url=http://www.ebi.ac.uk/Tools/msa/clustalw2/\|title=ClustalW2 < Multiple Sequence Alignment < EMBL-EBI\|last=EMBL-EBI\|website=www.EBI.ac.uk\|access-date=12 June 2017}}</ref> and T-coffee<ref>[https://web.archive.org/web/20080918022531/http://tcoffee.vital-it.ch/cgi-bin/Tcoffee/tcoffee_cgi/index.cgi T-coffee]</ref> for alignment, and BLAST<ref>{{cite web\|url=~~http~~https://blast.ncbi.nlm.nih.gov/Blast.cgi\|title=BLAST: Basic Local Alignment Search Tool\|website=blast.ncbi.nlm.NIH.gov\|access-date=12 June 2017}}</ref> and FASTA3x<ref>{{cite web\|url=http://fasta.bioch.virginia.edu/fasta_www2/fasta_list2.shtml\|title=UVA FASTA Server\|website=fasta.bioch.Virginia.edu\|access-date=12 June 2017}}</ref> for database searching. Commercial tools such as [[DNASTAR\|DNASTAR Lasergene]], [[Geneious]], and [[PatternHunter]] are also available. Tools annotated as performing [http://edamontology.org/operation_0292 sequence alignment] are listed in the [https://bio.tools/?page=1&function=%22Sequence%20alignment%22&sort=score bio.tools] registry. Alignment algorithms and software can be directly compared to one another using a standardized set of [[Benchmark (computing)\|benchmark]] reference multiple sequence alignments known as BAliBASE.<ref name=thompson2>{{cite journal \| journal=Bioinformatics \| volume=15 \| pages=87–8 \| year=1999 \|author1=Thompson JD \|author2=Plewniak F \|author3=Poch O \| title=BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs \| pmid=10068696 \| doi = 10.1093/bioinformatics/15.1.87 \| issue=1 \| doi-access=free }}</ref> The data set consists of structural alignments, which can be considered a standard against which purely sequence-based methods are compared. The relative performance of many common alignment methods on frequently encountered alignment problems has been tabulated and selected results published online at BAliBASE.<ref>[https://web.archive.org/web/20121130084356/http://bips.u-strasbg.fr/fr/Products/Databases/BAliBASE/prog_scores.html BAliBASE]</ref><ref name=thompson3>{{cite journal \| journal=Nucleic Acids Res \| volume=27 \| pages=2682–90 \| year=1999 \|author1=Thompson JD \|author2=Plewniak F \|author3=Poch O. \| title=A comprehensive comparison of multiple sequence alignment programs \| url= \| pmid=10373585 \| doi = 10.1093/nar/27.13.2682 \| issue=13 \| pmc=148477 }}</ref> A comprehensive list of BAliBASE scores for many (currently 12) different alignment tools can be computed within the protein workbench STRAP.<ref>{{cite web\|url=http://3d-alignment.eu/\|title=Multiple sequence alignment: Strap\|website=3d-alignment.eu\|access-date=12 June 2017}}</ref>