Content deleted Content added
rezaty lahiv kak i rezaty tebe David Eppstein (talk) |
→Maximal unique match: Deleted errant mid-sentence period. Tags: Mobile edit Mobile web edit |
||
(17 intermediate revisions by 15 users not shown) | |||
Line 13:
==Alignment methods==
Very short or very similar sequences can be aligned by hand.
| pmid = 22032267
| year = 2011
Line 53:
The SAMv1 spec document defines newer CIGAR codes. In most cases it is preferred to use the '=' and 'X' characters to denote matches or mismatches rather than the older 'M' character, which is ambiguous.
{| class="wikitable"
! CIGAR Code
! BAM Integer
! Description
|-
| M||0||alignment match (can be a sequence match or mismatch)||yes||yes
Line 80:
|
|}
*
* H can only be present as the first and/or last operation.
* S may only have H operations between them and the ends of the CIGAR string.
Line 89:
Global alignments, which attempt to align every residue in every sequence, are most useful when the sequences in the query set are similar and of roughly equal size. (This does not mean global alignments cannot start and/or end in gaps.) A general global alignment technique is the [[Needleman–Wunsch algorithm]], which is based on dynamic programming. Local alignments are more useful for dissimilar sequences that are suspected to contain regions of similarity or similar sequence motifs within their larger sequence context. The [[Smith–Waterman algorithm]] is a general local alignment method based on the same dynamic programming scheme but with additional choices to start and end at any place.<ref name="Polyanovsky2011"/>
Hybrid methods, known as semi-global or "glocal" (short for '''glo'''bal-lo'''cal''') methods, search for the best possible partial alignment of the two sequences (in other words, a combination of one or both starts and one or both ends is stated to be aligned). This can be especially useful when the downstream part of one sequence overlaps with the upstream part of the other sequence. In this case, neither global nor local alignment is entirely appropriate: a global alignment would attempt to force the alignment to extend beyond the region of overlap, while a local alignment might not fully cover the region of overlap.<ref name=brudno>{{cite journal|author1=Brudno M |author2=Malde S |author3=Poliakov A |author4=Do CB |author5=Couronne O |author6=Dubchak I |author7=Batzoglou S | year=2003 | title=Glocal alignment: finding rearrangements during alignment | journal= Bioinformatics | volume=Suppl 1| issue=90001| pages=i54–62| series=19 | pmid = 12855437| doi = 10.1093/bioinformatics/btg1005 | doi-access=
Fast expansion of genetic data challenges speed of current DNA sequence alignment algorithms. Essential needs for an efficient and accurate method for DNA variant discovery demand innovative approaches for parallel processing in real time. [[Optical computing]] approaches have been suggested as promising alternatives to the current electrical implementations, yet their applicability remains to be tested [https://onlinelibrary.wiley.com/doi/abs/10.1002/jbio.201900227].
Line 97:
===Maximal unique match===
One way of quantifying the utility of a given pairwise alignment is the '[[maximal unique match]]' (MUM), or the longest subsequence that occurs in both query sequences. Longer MUM sequences typically reflect closer relatedness
More precisely:
Line 122:
===Dynamic programming===
The technique of [[dynamic programming]] can be applied to produce global alignments via the [[Needleman-Wunsch algorithm]], and local alignments via the [[Smith-Waterman algorithm]]. In typical usage, protein alignments use a [[substitution matrix]] to assign scores to amino-acid matches or mismatches, and a [[gap penalty]] for matching an amino acid in one sequence to a gap in the other. DNA and RNA alignments may use a scoring matrix, but in practice often simply assign a positive match score, a negative mismatch score, and a negative gap penalty. (In standard dynamic programming, the score of each amino acid position is independent of the identity of its neighbors, and therefore [[base stacking]] effects are not taken into account. However, it is possible to account for such effects by modifying the algorithm.){{citation needed|date=April 2024}}
A common extension to standard linear gap costs are affine gap costs. Here two different gap penalties are applied for opening a gap and for extending a gap. Typically the former is much larger than the latter, e.g. -10 for gap open and -2 for gap extension. This results in fewer gaps in an alignment and residues and gaps are kept together, traits more representative of biological sequences. The Gotoh algorithm implements affine gap costs by using three matrices.<ref>{{Cite journal |last=Gotoh |first=Osamu |date=1982-12-15 |title=An improved algorithm for matching biological sequences |url=https://linkinghub.elsevier.com/retrieve/pii/0022283682903989 |journal=Journal of Molecular Biology |volume=162 |issue=3 |pages=705–708 |doi=10.1016/0022-2836(82)90398-9 |pmid=7166760 |issn=0022-2836|url-access=subscription }}</ref><ref>{{Cite journal |last=Gotoh |first=Osamu |date=1999-01-01 |title=Multiple sequence alignment: Algorithms and applications |url=https://linkinghub.elsevier.com/retrieve/pii/S0065227X99800070 |journal=Advances in Biophysics |volume=36 |pages=159–206 |doi=10.1016/S0065-227X(99)80007-0 |pmid=10463075 |issn=0065-227X|url-access=subscription }}</ref>
Dynamic programming can be useful in aligning nucleotide to protein sequences, a task complicated by the need to take into account [[frameshift]] mutations (usually insertions or deletions). The framesearch method produces a series of global or local pairwise alignments between a query nucleotide sequence and a search set of protein sequences, or vice versa. Its ability to evaluate frameshifts offset by an arbitrary number of nucleotides makes the method useful for sequences containing large numbers of indels, which can be very difficult to align with more efficient heuristic methods. In practice, the method requires large amounts of computing power or a system whose architecture is specialized for dynamic programming. The [[BLAST (biotechnology)|BLAST]] and [[EMBOSS]] suites provide basic tools for creating translated alignments (though some of these approaches take advantage of side-effects of sequence searching capabilities of the tools). More general methods are available from [[open-source software]] such as [http://www.ebi.ac.uk/Tools/psa/genewise/ GeneWise].{{citation needed|date=April 2024}}
Line 132 ⟶ 131:
Word methods, also known as ''k''-tuple methods, are [[heuristic]] methods that are not guaranteed to find an optimal alignment solution, but are significantly more efficient than dynamic programming. These methods are especially useful in large-scale database searches where it is understood that a large proportion of the candidate sequences will have essentially no significant match with the query sequence. Word methods are best known for their implementation in the database search tools [[FASTA]] and the [[BLAST (biotechnology)|BLAST]] family.<ref name=mount/> Word methods identify a series of short, nonoverlapping subsequences ("words") in the query sequence that are then matched to candidate database sequences. The relative positions of the word in the two sequences being compared are subtracted to obtain an offset; this will indicate a region of alignment if multiple distinct words produce the same offset. Only if this region is detected do these methods apply more sensitive alignment criteria; thus, many unnecessary comparisons with sequences of no appreciable similarity are eliminated.
In the FASTA method, the user defines a value ''k'' to use as the word length with which to search the database. The method is slower but more sensitive at lower values of ''k'', which are also preferred for searches involving a very short query sequence. The BLAST family of search methods provides a number of algorithms optimized for particular types of queries, such as searching for distantly related sequence matches. BLAST was developed to provide a faster alternative to FASTA without sacrificing much accuracy; like FASTA, BLAST uses a word search of length ''k'', but evaluates only the most significant word matches, rather than every word match as does FASTA. Most BLAST implementations use a fixed default word length that is optimized for the query and database type, and that is changed only under special circumstances, such as when searching with repetitive or very short query sequences. Implementations can be found via a number of web portals, such as [http://www.ebi.ac.uk/fasta33/ EMBL FASTA] and [https://web.archive.org/web/19970615060854/http://www.ncbi.nlm.nih.gov/BLAST/ NCBI BLAST].
==Multiple sequence alignment==
Line 140 ⟶ 139:
===Dynamic programming===
The technique of dynamic programming is theoretically applicable to any number of sequences; however, because it is computationally expensive in both time and [[computer memory|memory]], it is rarely used for more than three or four sequences in its most basic form. This method requires constructing the ''n''-dimensional equivalent of the sequence matrix formed from two sequences, where ''n'' is the number of sequences in the query. Standard dynamic programming is first used on all pairs of query sequences and then the "alignment space" is filled in by considering possible matches or gaps at intermediate positions, eventually constructing an alignment essentially between each two-sequence alignment. Although this technique is computationally expensive, its guarantee of a global optimum solution is useful in cases where only a few sequences need to be aligned accurately. One method for reducing the computational demands of dynamic programming, which relies on the "sum of pairs" [[objective function]], has been implemented in the [https://archive.today/20120805063524/http://www.ncbi.nlm.nih.gov/CBBresearch/Schaffer/msa.html MSA] software package.<ref name=lipman>{{cite journal | journal=Proc Natl Acad Sci USA | volume=86 | pages=4412–5 | year=1989 |author1=Lipman DJ |author2=Altschul SF |author3=Kececioglu JD | title=A tool for multiple sequence alignment | pmid=2734293 | doi=10.1073/pnas.86.12.4412 | issue=12 | pmc=287279 | bibcode=1989PNAS...86.4412L | doi-access=free }}</ref>
===Progressive methods===
Line 173 ⟶ 172:
===Combinatorial extension===
The combinatorial extension method of structural alignment generates a pairwise structural alignment by using local geometry to align short fragments of the two proteins being analyzed and then assembles these fragments into a larger alignment.<ref name=shindyalov>{{cite journal | journal=Protein Eng | volume=11 | pages=739–47 | year=1998 |author1=Shindyalov IN |author2=Bourne PE. | title=Protein structure alignment by incremental combinatorial extension (CE) of the optimal path | pmid=9796821 | doi = 10.1093/protein/11.9.739 | issue=9 | doi-access=
==Phylogenetic analysis==
Line 186 ⟶ 185:
In database searches such as BLAST, statistical methods can determine the likelihood of a particular alignment between sequences or sequence regions arising by chance given the size and composition of the database being searched. These values can vary significantly depending on the search space. In particular, the likelihood of finding a given alignment by chance increases if the database consists only of sequences from the same organism as the query sequence. Repetitive sequences in the database or query can also distort both the search results and the assessment of statistical significance; BLAST automatically filters such repetitive sequences in the query to avoid apparent hits that are statistical artifacts.
Methods of statistical significance estimation for gapped sequence alignments are available in the literature.<ref name="ortet"/><ref name=altschul>{{cite book|author1=Altschul SF |author2=Gish W |chapter=Local alignment statistics |title=Computer Methods for Macromolecular Sequence Analysis | year=1996
journal= Phys. Rev. E| volume=65| page=056102|doi=10.1103/PhysRevE.65.056102| pmid=12059642| issue=5|arxiv=cond-mat/0108201|bibcode=2002PhRvE..65e6102H| s2cid=193085}}</ref><ref name=newberg>{{cite journal| author=Newberg LA | year=2008 | title=Significance of gapped sequence alignments | journal= J Comput Biol| volume=15| pages=1187–1194 | pmid = 18973434 | doi=10.1089/cmb.2008.0125| issue=9| pmc=2737730}}</ref><ref name=eddy>{{cite journal| author=Eddy SR| year=2008 | title=A probabilistic model of local sequence alignment that simplifies statistical significance estimation | journal= PLOS Comput Biol | volume=4| editor1-first=Burkhard| pages=e1000069 | pmid = 18516236| editor1-last=Rost | doi=10.1371/journal.pcbi.1000069| issue=5| pmc=2396288| last2=Rost| first2=Burkhard| bibcode=2008PLSCB...4E0069E| s2cid=15640896 | doi-access=free }}</ref><ref name=bastien>{{cite journal|author1=Bastien O |author2=Aude JC |author3=Roy S |author4=Marechal E | year=2004 | title=Fundamentals of massive automatic pairwise alignments of protein sequences: theoretical significance of Z-value statistics | journal= Bioinformatics | volume=20| issue=4| pages=534–537| pmid = 14990449| doi = 10.1093/bioinformatics/btg440 | doi-access=free | citeseerx=10.1.1.602.6979 }}</ref><ref name=agrawal11>{{cite journal|author1=Agrawal A |author2=Huang X | year=2011| title=Pairwise Statistical Significance of Local Sequence Alignment Using Sequence-Specific and Position-Specific Substitution Matrices|journal= IEEE/ACM Transactions on Computational Biology and Bioinformatics| volume=8| pages=194–205|doi=10.1109/TCBB.2009.69|pmid=21071807 | issue=1|s2cid=6559731 }}</ref><ref name=agrawal08>{{cite journal| author1=Agrawal A| author2=Brendel VP| author3=Huang X| year=2008| title=Pairwise statistical significance and empirical determination of effective gap opening penalties for protein local sequence alignment| journal=International Journal of Computational Biology and Drug Design| volume=1| pages=347–367| doi=10.1504/IJCBDD.2008.022207| pmid=20063463| url=http://inderscience.metapress.com/content/1558538106522500/| issue=4| url-status=dead| archive-url=https://archive.today/20130128163812/http://inderscience.metapress.com/content/1558538106522500/| archive-date=28 January 2013| df=dmy-all| url-access=subscription}}</ref>
===Assessment of credibility===
Line 198 ⟶ 197:
==Other biological uses==
Sequenced RNA, such as [[expressed sequence tags]] and full-length mRNAs, can be aligned to a sequenced genome to find where there are genes and get information about [[alternative splicing]]<ref>{{cite book |author1=Kim N |author2=Lee C |title=Bioinformatics |chapter=Bioinformatics Detection of Alternative Splicing |volume=452 |pages=179–97 |year=2008 |pmid=18566765 |doi=10.1007/978-1-60327-159-2_9 |series=Methods in Molecular Biology |isbn=978-1-58829-707-5}}</ref> and [[RNA editing]].<ref>{{cite journal |vauthors=Li JB, Levanon EY, Yoon JK, etal |title=Genome-wide identification of human RNA editing sites by parallel DNA capturing and sequencing |journal=Science |volume=324 |issue=5931 |pages=1210–3 |date=May 2009 |pmid=19478186 |doi=10.1126/science.1170995|bibcode=2009Sci...324.1210L |s2cid=31148824 }}</ref> Sequence alignment is also a part of [[genome assembly]], where sequences are aligned to find overlap so that ''[[contig]]s'' (long stretches of sequence) can be formed.<ref>{{cite journal |vauthors=Blazewicz J, Bryja M, Figlerowicz M, etal |title=Whole genome assembly from 454 sequencing output via modified DNA graph concept |journal=Comput Biol Chem |volume=33 |issue=3 |pages=224–30 |date=June 2009 |pmid=19477687 |doi=10.1016/j.compbiolchem.2009.04.005}}</ref> Another use is [[single nucleotide polymorphism|SNP]] analysis, where sequences from different individuals are aligned to find single basepairs that are often different in a population.<ref>{{cite journal |author1=Duran C |author2=Appleby N |author3=Vardy M |author4=Imelfort M |author5=Edwards D |author6=Batley J |title=Single nucleotide polymorphism discovery in barley using autoSNPdb |journal=Plant Biotechnol. J. |volume=7 |issue=4 |pages=326–33 |date=May 2009 |pmid=19386041 |doi=10.1111/j.1467-7652.2009.00407.x |doi-access=free |bibcode=2009PBioJ...7..326D }}</ref>
==Non-biological uses==
The methods used for biological sequence alignment have also found applications in other fields, most notably in [[natural language processing]] and in [[Sequence analysis in social sciences|social sciences]], where the [[Needleman-Wunsch algorithm]] is usually referred to as [[Optimal matching]].<ref>{{cite journal|author1=Abbott A. |author2=Tsay A. | year=2000 | title=Sequence Analysis and Optimal Matching Methods in Sociology, Review and Prospect | journal=Sociological Methods and Research | volume=29|issue=1 | pages=3–33 | doi=10.1177/0049124100029001001|s2cid=121097811 }}</ref> Techniques that generate the set of elements from which words will be selected in [[natural language generation|natural-language generation]] algorithms have borrowed multiple sequence alignment techniques from bioinformatics to produce linguistic versions of [[automated theorem proving|computer-generated mathematical proofs]].<ref name=Barzilay>{{cite book|author1=Barzilay R |author2=Lee L. |title=Proceedings of the ACL-02 conference on Empirical methods in natural language processing - EMNLP '02 |chapter=Bootstrapping lexical choice via multiple-sequence alignment |year=2002 | pages=164–171 | chapter-url=
==Software==
{{Main|Sequence alignment software}}
A more complete list of available software categorized by algorithm and alignment type is available at [[sequence alignment software]], but common software tools used for general sequence alignment tasks include ClustalW2<ref>{{cite web|url=http://www.ebi.ac.uk/Tools/msa/clustalw2/|title=ClustalW2 < Multiple Sequence Alignment < EMBL-EBI|last=EMBL-EBI|website=www.EBI.ac.uk|access-date=12 June 2017}}</ref> and T-coffee<ref>[https://web.archive.org/web/20080918022531/http://tcoffee.vital-it.ch/cgi-bin/Tcoffee/tcoffee_cgi/index.cgi T-coffee]</ref> for alignment, and BLAST<ref>{{cite web|url=
Alignment algorithms and software can be directly compared to one another using a standardized set of [[Benchmark (computing)|benchmark]] reference multiple sequence alignments known as BAliBASE.<ref name=thompson2>{{cite journal | journal=Bioinformatics | volume=15 | pages=87–8 | year=1999 |author1=Thompson JD |author2=Plewniak F |author3=Poch O | title=BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs | pmid=10068696 | doi = 10.1093/bioinformatics/15.1.87 | issue=1 | doi-access=free }}</ref> The data set consists of structural alignments, which can be considered a standard against which purely sequence-based methods are compared. The relative performance of many common alignment methods on frequently encountered alignment problems has been tabulated and selected results published online at BAliBASE.<ref>[https://web.archive.org/web/20121130084356/http://bips.u-strasbg.fr/fr/Products/Databases/BAliBASE/prog_scores.html BAliBASE]</ref><ref name=thompson3>{{cite journal | journal=Nucleic Acids Res | volume=27 | pages=2682–90 | year=1999 |author1=Thompson JD |author2=Plewniak F |author3=Poch O. | title=A comprehensive comparison of multiple sequence alignment programs | url= | pmid=10373585 | doi = 10.1093/nar/27.13.2682 | issue=13 | pmc=148477 }}</ref> A comprehensive list of BAliBASE scores for many (currently 12) different alignment tools can be computed within the protein workbench STRAP.<ref>{{cite web|url=http://3d-alignment.eu/|title=Multiple sequence alignment: Strap|website=3d-alignment.eu|access-date=12 June 2017}}</ref>
Line 237 ⟶ 236:
[[Category:Evolutionary developmental biology]]
[[Category:Algorithms on strings]]
|