Content deleted Content added
add motif finding |
refconvert, add more recent review article |
||
Line 1:
[[Image:RPLP0_90_ClustalW_aln.gif|right|thumb|300px|First 90 positions of a protein multiple sequence alignment of instances of the acidic ribosomal protein P0 (L10E) from several organisms. Generated with [[ClustalW]].]]
Line 10 ⟶ 8:
The most direct method for producing an MSA uses the [[dynamic programming]] technique to identify the globally optimal alignment solution. For proteins, this method usually involves two sets of parameters: a [[gap penalty]] and a [[substitution matrix]] assigning scores or probabilities to the alignment of each possible pair of amino acids based on the similarity of the amino acids' chemical properties and the evolutionary probability of the mutation. For nucleotide sequences a substitution matrix can be used, but since there are only four possible standard characters per sequence and the individual nucleotides do not typically differ much in substitution probability, the parameters for DNA and RNA sequences usually consist of a gap penalty, a positive score for character matches, and a negative score for mismatches.
For ''n'' individual sequences, the method requires constructing the ''n''-dimensional equivalent of the matrix formed in standard pairwise dynamic programming. The search space thus increases exponentially with increasing ''n'' and is also strongly dependent on sequence length. To find the global optimum for ''n'' sequences this way has been shown to be an [[NP-complete]] problem
==Progressive alignment construction==
Line 17 ⟶ 15:
One major limitation of progressive methods is their heavy dependence on the initial assignment of relatedness and on the quality of the initial alignment. The methods are thus sensitive as well to the distribution of sequences in the query set; performance improves when relatedness among query sequences is a relatively smooth gradient rather than distantly separated clusters. Performance also degrades significantly when all of the sequences in the set are rather distantly related, because inaccuracies in the initial alignment are then more likely. Most modern progressive methods modify their scoring function with a secondary weighting function that assigns scaling factors to individual members of the query set in a nonlinear fashion based on their phylogenetic distance from their nearest neighbors. Judicious choice of weighting can aid in evaluating relatedness and mitigate the effects of relatively poor initial alignments early in the progression.
Progressive alignment methods are efficient enough to implement on a large scale for many sequences and are often run on publicly accessible web servers so users need not locally install the applications of interest. A very popular progressive alignment method is the [[Clustal]] family
Another common progressive alignment method called [[T-Coffee]]
Because progressive methods are heuristics that are not guaranteed to converge to a global optimum, alignment quality can be difficult to evaluate and their true biological significance can be obscure. A very recent semi-progressive method that improves alignment quality and does not use a lossy heuristic while still running in [[polynomial time]]
==Iterative methods==
A set of methods to produce MSAs while reducing the errors inherent in progressive methods are classified as "iterative" because they work similarly to progressive methods but repeatedly realign the initial sequences as well as adding new sequences to the growing MSA. One reason progressive methods are so strongly dependent on a high-quality initial alignment is the fact that these alignments are always incorporated into the final result - that is, once a sequence has been aligned into the MSA, its alignment is not considered further. This approximation improves efficiency at the cost of accuracy. By contrast, iterative methods can return to previously calculated pairwise alignments or sub-MSAs incorporating subsets of the query sequence as a means of optimizing a general [[objective function]] such as finding a high-quality alignment score.
A variety of subtly different iteration methods have been implemented and made available in software packages; reviews and comparisons have been useful but generally refrain from choosing a "best" technique
Another iterative program, DIALIGN, takes an unusual approach of focusing narrowly on local alignments between sub-segments or [[sequence motif]]s without introducing a gap penalty
A third popular iteration-based method called MUSCLE (multiple sequence alignment by log-expectation) improves on progressive methods with a more accurate distance measure to assess the relatedness of two sequences
==Hidden Markov models==
[[Hidden Markov model]]s are probabilistic models that can assign likelihoods to all possible combinations of gaps, matches, and mismatches to determine the most likely MSA or set of possible MSAs. HMMs can produce a single highest-scoring output but can also generate a family of possible alignments that can then be evaluated for biological significance. Because HMMs are probablistics, they do not produce the same solution every time they are run on the same dataset; thus they cannot be guaranteed to converge to an optimal alignment. HMMs can produce both global and local alignments. Although HMM-based methods have been developed relatively recently, they offer significant improvements in computational speed, especially for sequences that contain overlapping regions
Typical HMM-based methods work by representing an MSA as a form of [[directed acyclic graph]] known as a partial-order graph, which consists of a series of nodes representing possible entries in the columns of an MSA. In this representation a column that is absolutely conserved (that is, that all the sequences in the MSA share a particular character at a particular position) is coded as a single node with as many outgoing connections as there are possible characters in the next column of the alignment. In the terms of a typical hidden Markov model, the observed states are the individual alignment columns and the "hidden" states represent the presumed ancestral sequence from which the sequences in the query set are hypothesized to have descended. A efficient search variant of the dynamic programming method, known as the [[Viterbi algorithm]], is generally used to successively align the growing MSA to the next sequence in the query set to produce a new MSA
Several software programs are available in which variants of HMM-based methods have been implemented and which are noted for their scalability and efficiency, although properly using an HMM method is more complex than using more common progressive methods. The simplest is [http://sourceforge.net/project/showfiles.php?group_id=168080 POA] (Partial-Order Alignment, temporarily available for download via [[SourceForge]]); a similar but more generalized method is implemented in the package [http://www.cse.ucsc.edu/compbio/sam.html SAM] (Sequence Alignment and Modeling System), which is also available as a [http://www.cse.ucsc.edu/compbio/HMM-apps/T02-query.html web portal]. SAM has been used as a source of alignments for [[protein structure prediction]] to participate in the [[CASP]] structure prediction experiment and to develop a database of predicted proteins in the yeast species [[S. cerevisiae]]. A database-search technique based on HMM methods is available in the program [http://hmmer.wustl.edu/ HMMer].
==Genetic algorithms and simulated annealing==
Standard optimization techniques in computer science - both of which were inspired by, but do not directly reproduce, physical processes - have also been used in an attempt to more efficiently produce quality MSAs. On such technique, [[genetic algorithm]]s, have been used for MSA production in an attempt to broadly simulate the hypothesized evolutionary process that gave rise to the divergence in the query set. The method works by breaking a series of possible MSAs into fragments and repeatedly rearranging those fragments with the introduction of gaps at varying positions. A general [[objective function]] is optimized during the simulation, most generally the "sum of pairs" maximization function introduced in dynamic programming-based MSA methods. A technique for protein sequences has been implemented in the software program SAGA (Sequence Alignment by Genetic Algorithm)
The technique of [[simulated annealing]], by which an existing MSA produced by another method is refined by a series of rearrangements designed to find more optimal regions of alignment space than the one the input alignment already occupies. Like the genetic algorithm method, simulated annealing maximizes an objective function like the sum-of-pairs function. Simulated annealing uses a metaphorical "temperature factor" that determines the rate at which rearrangements proceed and the likelihood of each rearrangement; typical usage alternates periods of high rearrangement rates with relatively low likelihood (to explore more distant regions of alignment space) with periods of lower rates and higher likelihoods to more thoroughly explore local minima near the newly "colonized" regions. This approach has been implemented in the program MSASA (Multiple Sequence Alignment by Simulated Annealing)
==Motif finding==
Motif finding, also known as profile analysis, is a method of locating [[sequence motif]]s in global MSAs that is both a means of producing a better MSA and a means of producing a scoring matrix for use in searching other sequences for similar motifs. A variety of methods for isolating the motifs have been developed, but all are based on identifying short highly conserved patterns within the larger alignment and constructing a matrix similar to a substitution matrix that reflects the amino acid or nucleotide composition of each position in the putative motif. The alignment can then be refined using these matrices. In standard profile analysis, the matrix includes entries for each possible character as well as entries for gaps
Blocks analysis is a method of motif finding that restricts motifs to ungapped regions in the alignment. Blocks can be generated from an MSA or they can be extracted from unaligned sequences using a precalculated set of common motifs previously generated from known gene families
Statistical pattern-matching has been implemented using both the [[expectation-maximization algorithm]] and the [[Gibbs sampler]]. One of the most common motif-finding tools, known as MEME, uses expectation maximization and hidden Markov methods to generate motifs that are then used as search tools by its companion program MAST. Both are available at [http://meme.sdsc.edu/meme/intro.html MEME/MAST].
Line 56 ⟶ 54:
== References ==
<div class="references-small"><references /></div>
=== Survey articles ===
*{{cite book | first = L. | last = Duret | coauthors = S. Abdeddaim | year = 2000 | chapter = Multiple alignment for structural functional or phylogenetic analyses of homologous sequences | editor = D. Higgins and W. Taylor | title = Bioinformatics sequence structure and databanks | ___location = Oxford | publisher = Oxford University Press}}
*{{cite journal | first = C. | last = Notredame | year = 2002 | title = Recent progresses in multiple sequence alignment: a survey | journal = Pharmacogenomics | volume = 31 | issue = 1 | pages = 131 -- 144 }}
*{{cite journal | first = J. D. | last = Thompson | coauthors = F. Plewniak and O. Poch | year = 1999 | title = A comprehensive comparison of multiple sequence alignment programs | journal = Nucleic Acids Research | volume = 27 | issue = 13 | pages = 12682--2690 }}
*{{cite journal | first = I.M. | last = Wallace | coauthors = Blackshields G and Higgins DG. | year = 2005 | title = Multiple sequence alignments | journal = Curr Opin Struct Biol | volume = 15 | issue = 3 | pages = 261-6.}}
[[Category:Bioinformatics]]
|