Multiple sequence alignment: Difference between revisions

Content deleted Content added
Bluebot (talk | contribs)
Fixing header errors per the Manual of Style
add motif finding
Line 43:
 
The technique of [[simulated annealing]], by which an existing MSA produced by another method is refined by a series of rearrangements designed to find more optimal regions of alignment space than the one the input alignment already occupies. Like the genetic algorithm method, simulated annealing maximizes an objective function like the sum-of-pairs function. Simulated annealing uses a metaphorical "temperature factor" that determines the rate at which rearrangements proceed and the likelihood of each rearrangement; typical usage alternates periods of high rearrangement rates with relatively low likelihood (to explore more distant regions of alignment space) with periods of lower rates and higher likelihoods to more thoroughly explore local minima near the newly "colonized" regions. This approach has been implemented in the program MSASA (Multiple Sequence Alignment by Simulated Annealing){{ref|Kim}}.
 
==Motif finding==
Motif finding, also known as profile analysis, is a method of locating [[sequence motif]]s in global MSAs that is both a means of producing a better MSA and a means of producing a scoring matrix for use in searching other sequences for similar motifs. A variety of methods for isolating the motifs have been developed, but all are based on identifying short highly conserved patterns within the larger alignment and constructing a matrix similar to a substitution matrix that reflects the amino acid or nucleotide composition of each position in the putative motif. The alignment can then be refined using these matrices. In standard profile analysis, the matrix includes entries for each possible character as well as entries for gaps{{ref|Mount}}. Alternatively, statistical pattern-finding algorithms can identify motifs as a precursor to an MSA rather than as a derivation. In many cases when the query set contains only a small number of sequences or contains only highly related sequences, [[pseudocount]]s are added to normalize the distribution reflected in the scoring matrix. In particular, this corrects zero-probability entries in the matrix to values that are small but nonzero.
 
Blocks analysis is a method of motif finding that restricts motifs to ungapped regions in the alignment. Blocks can be generated from an MSA or they can be extracted from unaligned sequences using a precalculated set of common motifs previously generated from known gene families{{ref|Henikoff}}. Block scoring generally relies on the spacing of high-frequency characters rather than on the calculation of an explicit substitution matrix. A server for locating motifs in unaligned sequences is located at [http://blocks.fhcrc.org/ BLOCKS].
 
Statistical pattern-matching has been implemented using both the [[expectation-maximization algorithm]] and the [[Gibbs sampler]]. One of the most common motif-finding tools, known as MEME, uses expectation maximization and hidden Markov methods to generate motifs that are then used as search tools by its companion program MAST. Both are available at [http://meme.sdsc.edu/meme/intro.html MEME/MAST].
 
==See also==
Line 64 ⟶ 71:
#{{note|Notredame3}} Notredame C, O'Brien EA, Higgins DG. (1997). RAGA: RNA sequence alignment by genetic algorithm. ''Nucleic Acids Res'' 25(22):4570-80.
#{{note|Kim}} Kim J, Pramanik S, Chung MJ. (1994). Multiple sequence alignment using simulated annealing. ''Comput Appl Biosci'' 10(4):419-26.
#{{note|Henikoff}} Henikoff S, Henikoff JG. (1991). Automated assembly of protein blocks for database searching. ''Nucleic Acids Res'' 19:6565-72.
 
=== Survey articles ===