Content deleted Content added
m v2.04b - Bot T20 CW#61 - Fix errors for CW project (Reference before punctuation) |
Link suggestions feature: 3 links added. |
||
(32 intermediate revisions by 18 users not shown) | |||
Line 1:
{{
[[File:RPLP0 90 ClustalW aln.gif
'''Multiple sequence alignment''' ('''MSA''')
==Problem statement==
Given <math>m</math> sequences <math>S_i</math>, <math>i = 1,\cdots,m</math> similar to the form below:
Line 42 ⟶ 41:
===Tracing alignments===
When determining the best suited alignments for each MSA, a ''trace'' is usually generated. A trace is a set of ''realized'', or corresponding and aligned, vertices that has a specific weight based on the edges that are selected between corresponding vertices. When choosing traces for a set of sequences it is necessary to choose a trace with a maximum weight to get the best alignment of the sequences.
==Alignment methods==
There are various alignment methods used within multiple sequence to maximize scores and correctness of alignments. Each is usually based on a certain heuristic with an insight into the evolutionary process. Most try to replicate evolution to get the most realistic alignment possible to best predict relations between sequences.
===
A direct method for producing an MSA uses the [[dynamic programming]] technique to identify the globally optimal alignment solution. For proteins, this method usually involves two sets of parameters: a [[gap penalty]] and a [[substitution matrix]] assigning scores or probabilities to the alignment of each possible pair of amino acids based on the similarity of the amino acids' chemical properties and the evolutionary probability of the mutation. For nucleotide sequences, a similar gap penalty is used, but a much simpler substitution matrix, wherein only identical matches and mismatches are considered, is typical. The scores in the substitution matrix may be either all positive or a mix of positive and negative in the case of a global alignment, but must be both positive and negative, in the case of a local alignment.<ref>{{cite web|title=Help with matrices used in sequence comparison tools|url=http://www.ebi.ac.uk/help/matrix.html|url-status=dead|archive-url=https://web.archive.org/web/20100311140200/http://www.ebi.ac.uk/help/matrix.html|archive-date=March 11, 2010|
For ''n'' individual sequences, the naive method requires constructing the ''n''-dimensional equivalent of the matrix formed in standard pairwise [[sequence alignment]]. The search space thus increases exponentially with increasing ''n'' and is also strongly dependent on sequence length. Expressed with the [[big O notation]] commonly used to measure [[Computational complexity theory|computational complexity]], a [[Naïve algorithm|naïve]] MSA takes ''O(Length<sup>Nseqs</sup>)'' time to produce.
===
The most widely used approach to multiple sequence alignments uses a heuristic search known as progressive technique (also known as the hierarchical or tree method) developed by Da-Fei Feng and Doolittle in 1987.<ref name="feng1987progressive">{{cite journal |vauthors=Feng DF, Doolittle RF |year=1987 |title=Progressive sequence alignment as a
Progressive alignments are not guaranteed to be globally optimal. The primary problem is that when errors are made at any stage in growing the MSA, these errors are then propagated through to the final result. Performance is also particularly bad when all of the sequences in the set are rather distantly related. Most modern progressive methods modify their scoring function with a secondary weighting function that assigns scaling factors to individual members of the query set in a nonlinear fashion based on their phylogenetic distance from their nearest neighbors. This corrects for non-random selection of the sequences given to the alignment program.<ref name="mount"/>
Progressive alignment methods are efficient enough to implement on a large scale for many (100s to 1000s) sequences.
Another common progressive alignment method
Because progressive methods are heuristics that are not guaranteed to converge to a global optimum, alignment quality can be difficult to evaluate and their true biological significance can be obscure. A semi-progressive method that improves alignment quality and does not use a lossy heuristic while
===Iterative methods===
A set of methods to produce MSAs while reducing the errors inherent in progressive methods are classified as "iterative" because they work similarly to progressive methods but repeatedly realign the initial sequences as well as adding new sequences to the growing MSA. One reason progressive methods are so strongly dependent on a high-quality initial alignment is the fact that these alignments are always incorporated into the final result
A variety of subtly different iteration methods have been implemented and made available in software packages; reviews and comparisons have been useful but generally refrain from choosing a "best" technique.<ref name="hirosawa">{{cite journal |vauthors=Hirosawa M, Totoki Y, Hoshida M, Ishikawa M |
Another iterative program, DIALIGN, takes an unusual approach of focusing narrowly on local alignments between sub-segments or [[sequence motif]]s without introducing a gap penalty.<ref name="brudno">{{cite journal |
A third popular iteration-based method
===Consensus methods===
Consensus methods attempt to find the optimal multiple sequence alignment given multiple different alignments of the same set of sequences. There are two commonly used consensus methods,
===Hidden Markov models===
[[File:A profile HMM modelling a multiple sequence alignment.png|thumb|A profile [[hidden Markov model]] (HMM) modelling a multiple sequence alignment]]
A [[
Typical HMM-based methods work by representing an MSA as a form of [[directed acyclic graph]] known as a partial-order graph, which consists of a series of nodes representing possible entries in the columns of an MSA. In this representation a column that is absolutely conserved (that is, that all the sequences in the MSA share a particular character at a particular position) is coded as a single node with as many outgoing connections as there are possible characters in the next column of the alignment. In the terms of a typical hidden Markov model, the observed states are the individual alignment columns and the "hidden" states represent the presumed ancestral sequence from which the sequences in the query set are hypothesized to have descended. An efficient search variant of the dynamic programming method,
▲[[Hidden Markov model]]s are probabilistic models that can assign likelihoods to all possible combinations of gaps, matches, and mismatches to determine the most likely MSA or set of possible MSAs. HMMs can produce a single highest-scoring output but can also generate a family of possible alignments that can then be evaluated for biological significance. HMMs can produce both global and local alignments. Although HMM-based methods have been developed relatively recently, they offer significant improvements in computational speed, especially for sequences that contain overlapping regions.<ref name="mount" />
Several software programs are available in which variants of HMM-based methods have been implemented and which are noted for their scalability and efficiency, although properly using an HMM method is more complex than using more common progressive methods. The simplest is
▲Typical HMM-based methods work by representing an MSA as a form of [[directed acyclic graph]] known as a partial-order graph, which consists of a series of nodes representing possible entries in the columns of an MSA. In this representation a column that is absolutely conserved (that is, that all the sequences in the MSA share a particular character at a particular position) is coded as a single node with as many outgoing connections as there are possible characters in the next column of the alignment. In the terms of a typical hidden Markov model, the observed states are the individual alignment columns and the "hidden" states represent the presumed ancestral sequence from which the sequences in the query set are hypothesized to have descended. An efficient search variant of the dynamic programming method, known as the [[Viterbi algorithm]], is generally used to successively align the growing MSA to the next sequence in the query set to produce a new MSA.<ref name="hughey">{{cite journal |vauthors=Hughey R, Krogh A | year = 1996 | title = Hidden Markov models for sequence analysis: extension and analysis of the basic method | journal = CABIOS | volume = 12 | issue = 2| pages = 95–107 | pmid = 8744772 | doi=10.1093/bioinformatics/12.2.95| citeseerx = 10.1.1.44.3365 }}</ref> This is distinct from progressive alignment methods because the alignment of prior sequences is updated at each new sequence addition. However, like progressive methods, this technique can be influenced by the order in which the sequences in the query set are integrated into the alignment, especially when the sequences are distantly related.<ref name="mount" />
SAM has been used as a source of alignments for [[protein structure prediction]] to participate in the Critical Assessment of Structure Prediction ([[CASP]]) structure prediction experiment and to develop a database of predicted proteins in the [[yeast]] species ''[[S. cerevisiae]]''. [[
▲Several software programs are available in which variants of HMM-based methods have been implemented and which are noted for their scalability and efficiency, although properly using an HMM method is more complex than using more common progressive methods. The simplest is [http://sourceforge.net/projects/poamsa/files/ POA] (Partial-Order Alignment)<!--this download link is temporary, remember to replace when it's fixed-->;<ref name="grasso">{{cite journal | doi = 10.1093/bioinformatics/bth126 |vauthors=Grasso C, Lee C | year = 2004 | title = Combining partial order alignment and progressive multiple sequence alignment increases alignment speed and scalability to very large alignment problems | journal = Bioinformatics | volume = 20 | issue = 10| pages = 1546–56 | pmid = 14962922 | doi-access = free }}</ref> a similar but more generalized method is implemented in the packages [http://compbio.soe.ucsc.edu/sam.html SAM] (Sequence Alignment and Modeling System).<ref name="hugheyT">Hughey R, Krogh A. SAM: Sequence alignment and modeling software system. Technical Report UCSC-CRL-96-22, University of California, Santa Cruz, CA, September 1996.</ref> and [[HMMER]].<ref name="durbin">Durbin R, Eddy S, Krogh A, Mitchison G. (1998). Biological sequence analysis: probabilistic models of proteins and nucleic acids, Cambridge University Press, 1998.</ref>
▲SAM has been used as a source of alignments for [[protein structure prediction]] to participate in the [[CASP]] structure prediction experiment and to develop a database of predicted proteins in the [[yeast]] species ''[[S. cerevisiae]]''. [[HHpred / HHsearch|HHsearch]]<ref>{{ cite journal| author = Söding J | title = Protein homology detection by HMM-HMM comparison| journal = Bioinformatics| year = 2005| volume = 21| issue = 7| pages = 951–960| pmid = 15531603| doi = 10.1093/bioinformatics/bti125| citeseerx = 10.1.1.519.1257}}</ref> is a software package for the detection of remotely related protein sequences based on the pairwise comparison of HMMs. A server running HHsearch ([[HHpred / HHsearch|HHpred]]) was by far the fastest of the 10 best automatic structure prediction servers in the CASP7 and CASP8 structure prediction competitions.<ref>{{ cite journal|vauthors=Battey JN, Kopp J, Bordoli L, Read RJ, Clarke ND, Schwede T | title = Automated server predictions in CASP7| journal = Proteins | year = 2007 | volume = 69 | issue = Suppl 8 | pages = 68–82 | pmid = 17894354| doi = 10.1002/prot.21761 }}</ref>
===Phylogeny-aware methods===
[[File:Non-homologous exon alignment.jpg|thumb|450px|Non-homologous exon alignment by an iterative method (a), and by a phylogeny-aware method (b)]]
Most multiple sequence alignment methods try to minimize the number of [[Indel|insertions/deletions]] (gaps) and, as a consequence, produce compact alignments. This causes several problems if the sequences to be aligned contain non-[[Homology (biology)#Sequence homology|homologous]] regions, if gaps are informative in a [[phylogeny]] analysis. These problems are common in newly produced sequences that are poorly annotated and may contain [[Frameshift|frame-shifts]], wrong [[Protein ___domain|domains]] or non-homologous [[RNA splicing|spliced]] [[
In 2012, two new phylogeny-aware tools appeared. One is called ''PAGAN'' that was developed by the same team as PRANK.<ref name="Loytynoja-2012">{{cite journal |
===Motif finding===
[[File:Caspase-motif-alignment.png|right|thumb|500px|Alignment of the seven [[Drosophila]] [[caspase]]s colored by motifs as identified by MEME. When motif positions and sequence alignments are generated independently, they often correlate well but not perfectly, as in this example.]]
Motif finding, also known as profile analysis, is a method of locating [[sequence motif]]s in global MSAs that is both a means of producing a better MSA and a means of producing a scoring matrix for use in searching other sequences for similar motifs. A variety of methods for isolating the motifs have been developed, but all are based on identifying short highly conserved patterns within the larger alignment and constructing a matrix similar to a substitution matrix that reflects the amino acid or nucleotide composition of each position in the putative motif. The alignment can then be refined using these matrices. In standard profile analysis, the matrix includes entries for each possible character as well as entries for gaps.<ref name="mount"
Blocks analysis is a method of motif finding that restricts motifs to ungapped regions in the alignment. Blocks can be generated from an MSA or they can be extracted from unaligned sequences using a precalculated set of common motifs previously generated from known gene families.<ref name="henikoff1991">{{cite journal |
Statistical pattern-matching has been implemented using both the [[expectation-maximization algorithm]] and the [[Gibbs sampler]]. One of the most common motif-finding tools,
===Non-coding multiple sequence alignment===
[[Non-coding DNA]] regions, especially [[transcription factor]] binding sites (TFBSs), are
==Optimization==
===Genetic algorithms and simulated annealing===
Standard optimization techniques in computer science
The technique of [[simulated annealing]], by which an existing MSA produced by another method is refined by a series of rearrangements designed to find better regions of alignment space than the one the input alignment already occupies. Like the genetic algorithm method, simulated annealing maximizes an objective function like the sum-of-pairs function. Simulated annealing uses a metaphorical "temperature factor" that determines the rate at which rearrangements proceed and the likelihood of each rearrangement; typical usage alternates periods of high rearrangement rates with relatively low likelihood (to explore more distant regions of alignment space) with periods of lower rates and higher likelihoods to more thoroughly explore local minima near the newly "colonized" regions. This approach has been implemented in the program MSASA (Multiple Sequence Alignment by Simulated Annealing).<ref name="kim">{{cite journal |vauthors=Kim J, Pramanik S, Chung MJ | year = 1994 | title = Multiple sequence alignment using simulated annealing | journal = Comput Appl Biosci | volume = 10 | issue = 4| pages = 419–26 | pmid = 7804875 | doi=10.1093/bioinformatics/10.4.419}}</ref>▼
▲The technique of [[simulated annealing]], by which an existing MSA produced by another method is refined by a series of rearrangements designed to find better regions of alignment space than the one the input alignment already occupies. Like the genetic algorithm method, simulated annealing maximizes an objective function like the sum-of-pairs function. Simulated annealing uses a metaphorical "temperature factor" that determines the rate at which rearrangements proceed and the likelihood of each rearrangement; typical usage alternates periods of high rearrangement rates with relatively low likelihood (to explore more distant regions of alignment space) with periods of lower rates and higher likelihoods to more thoroughly explore local minima near the newly "colonized" regions. This approach has been implemented in the program MSASA (Multiple Sequence Alignment by Simulated Annealing).<ref name="kim">{{cite journal |vauthors=Kim J, Pramanik S, Chung MJ |
=== Mathematical programming and exact solution algorithms ===▼
[[Mathematical optimization|Mathematical programming]] and in particular [[
===Simulated quantum computing===
In January 2017, [[D-Wave Systems]] announced that its qbsolv open-source [[quantum computing]] software had been successfully used to find a faster solution to the MSA problem.<ref>
==Alignment visualization and quality control==
The necessary use of heuristics for multiple alignment means that for an arbitrary set of proteins, there is always a good chance that an alignment will contain errors. For example, an evaluation of several leading alignment programs using the [[List of sequence alignment software#Benchmarking|BAliBase benchmark]] found that at least 24% of all pairs of aligned amino acids were incorrectly aligned.<ref name="nuin2006">{{cite journal |vauthors=Nuin PA, Wang Z, Tillier ER |year=2006 |title=The accuracy of several multiple sequence alignment programs for proteins |journal=BMC Bioinformatics |doi=10.1186/1471-2105-7-471 |pmid=17062146 |volume=7 |pmc=1633746 |pages=471 |doi-access=free}}</ref> These errors can arise because of unique insertions into one or more regions of sequences, or through some more complex evolutionary process leading to proteins that do not align easily by sequence alone. As the number of sequence and their divergence increases many more errors will be made simply because of the heuristic nature of MSA algorithms. [[List of alignment visualization software|Multiple sequence alignment viewers]] enable alignments to be visually reviewed, often by inspecting the quality of alignment for annotated functional sites on two or more sequences. Many also enable the alignment to be edited to correct these (usually minor) errors, in order to obtain an optimal 'curated' alignment suitable for use in phylogenetic analysis or comparative modeling.<ref>{{cite web |
However, as the number of sequences increases and especially in genome-wide studies that involve many MSAs it is impossible to manually curate all alignments. Furthermore, manual curation is subjective. And finally, even the best expert cannot confidently align the more ambiguous cases of highly diverged sequences. In such cases it is common practice to use automatic procedures to exclude unreliably aligned regions from the MSA. For the purpose of phylogeny reconstruction (see below) the Gblocks program is widely used to remove alignment blocks suspect of low quality, according to various cutoffs on the number of gapped sequences in alignment columns.<ref name="castresana2000">{{cite journal |
There are free programs available for visualization of multiple sequence alignments, for example [[Jalview]] and [[UGENE]].
==Phylogenetic use==
Multiple sequence alignments can be used to create a [[phylogenetic tree]].<ref name="Budd_2009">{{Cite web |
Multiple sequence alignments can also be used to identify functionally important sites, such as binding sites, active sites, or sites corresponding to other key functions, by locating conserved domains. When looking at multiple sequence alignments, it is useful to consider different aspects of the sequences when comparing sequences. These aspects include identity, similarity, and homology. Identity means that the sequences have identical residues at their respective positions. On the other hand, similarity has to do with the sequences being compared having similar residues quantitatively. For example, in terms of nucleotide sequences, pyrimidines are considered similar to each other, as are purines. Similarity ultimately leads to homology, in that the more similar sequences are, the closer they are to being homologous. This similarity in sequences can then go on to help find common ancestry.<ref name="Budd_2009"/>
==See also==
*[[Alignment-free sequence analysis]]▼
*[[Cladistics]]
*[[Generalized tree alignment]]
*[[List of alignment visualization software|Multiple sequence alignment viewers]]▼
*[[PANDIT (database)|PANDIT]], a biological database covering protein domains
*[[Phylogenetics]]
*[[Sequence alignment software]]
▲*[[List of alignment visualization software|Multiple sequence alignment viewers]]
*[[Structural alignment]]
▲*[[Alignment-free sequence analysis]]
==References==
Line 143 ⟶ 138:
===Survey articles===
*{{Cite book |
*{{Cite journal |
*{{Cite journal |
*{{Cite journal |
*{{Cite journal |
==External links==
* [http://www.expasy.org/tools/#align ExPASy sequence alignment tools]
* [https://web.archive.org/web/20100107231637/http://www.techfak.uni-bielefeld.de/bcd/Curric/MulAli/welcome.html Archived Multiple Alignment Resource Page]
* [https://web.archive.org/web/20060909003117/http://pbil.univ-lyon1.fr/alignment.html Tools for Multiple Alignments]
* [http://www.clustal.org/ An entry point to clustal servers and information]
* [http://www.tcoffee.org/ An entry point to the main T-Coffee servers]
* [http://www.mergealign.com/ An entry point to the main MergeAlign server and information]
*European Bioinformatics Institute servers:
** [http://www.ebi.ac.uk/Tools/clustalw2/ ClustalW2]
** [http://www.ebi.ac.uk/Tools/muscle/ Muscle]
** [http://www.ebi.ac.uk/Tools/t-coffee/ T-coffee]
** [http://www.ebi.ac.uk/Tools/mafft/ MAFFT]
** [http://www.ebi.ac.uk/Tools/kalign/ KALIGN]
===Lecture notes, tutorials, and courses===
* [http://cmb.molgen.mpg.de/ Multiple sequence alignment lectures]
* [https://web.archive.org/web/20120305164700/http://www.embl.de/~seqanal/courses/commonCourseContent/commonMsaExercises.html Lecture Notes and practical exercises] on multiple sequence alignments at the [[European Molecular Biology Laboratory
* [https://web.archive.org/web/20110904184621/http://www.avatar.se/molbioinfo2001/
* [https://web.archive.org/web/20090821192233/http://bioinf.may.ie/school02/notes.html Molecular Evolution and Bioinformatics Lecture Notes]
{{DEFAULTSORT:Multiple Sequence Alignment}}
Line 175 ⟶ 168:
[[Category:Computational phylogenetics]]
[[Category:Markov models]]
[[Category:NP-complete problems]]
|