Multiple sequence alignment: Difference between revisions

Content deleted Content added
Hidden Markov models: rm external links
Motif finding: rm external links
Line 97:
Motif finding, also known as profile analysis, is a method of locating [[sequence motif]]s in global MSAs that is both a means of producing a better MSA and a means of producing a scoring matrix for use in searching other sequences for similar motifs. A variety of methods for isolating the motifs have been developed, but all are based on identifying short highly conserved patterns within the larger alignment and constructing a matrix similar to a substitution matrix that reflects the amino acid or nucleotide composition of each position in the putative motif. The alignment can then be refined using these matrices. In standard profile analysis, the matrix includes entries for each possible character as well as entries for gaps.<ref name="mount" /> Alternatively, statistical pattern-finding algorithms can identify motifs as a precursor to an MSA rather than as a derivation. In many cases when the query set contains only a small number of sequences or contains only highly related sequences, [[pseudocount]]s are added to normalize the distribution reflected in the scoring matrix. In particular, this corrects zero-probability entries in the matrix to values that are small but nonzero.
 
Blocks analysis is a method of motif finding that restricts motifs to ungapped regions in the alignment. Blocks can be generated from an MSA or they can be extracted from unaligned sequences using a precalculated set of common motifs previously generated from known gene families.<ref name="henikoff1991">{{cite journal | vauthors = Henikoff S, Henikoff JG | title = Automated assembly of protein blocks for database searching | journal = Nucleic Acids Res. | volume = 19 | issue = 23 | pages = 6565–72 | date = December 1991 | pmid = 1754394 | pmc = 329220 | doi = 10.1093/nar/19.23.6565 }}</ref> Block scoring generally relies on the spacing of high-frequency characters rather than on the calculation of an explicit substitution matrix. The [https://web.archive.org/web/20130328131920/http://blocks.fhcrc.org/ BLOCKS] server provides an interactive method to locate such motifs in unaligned sequences.
 
Statistical pattern-matching has been implemented using both the [[expectation-maximization algorithm]] and the [[Gibbs sampler]]. One of the most common motif-finding tools, known as [[Multiple EM for Motif Elicitation|MEME]], uses expectation maximization and hidden Markov methods to generate motifs that are then used as search tools by its companion MAST in the combined suite [http://meme.sdsc.edu/meme/intro.html MEME/MAST] {{Webarchive|url=https://web.archive.org/web/20100822143504/http://meme.sdsc.edu/meme/intro.html |date=2010-08-22 }}.<ref name="baileyelkan1994">{{cite book |vauthors=Bailey TL, Elkan C |year=1994 |chapter=Fitting a mixture model by expectation maximization to discover motifs in biopolymers |title=Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology |pages=28–36 |publisher=AAAI Press |___location=Menlo Park, California|chapter-url=http://www.cs.toronto.edu/~brudno/csc2417_15/10.1.1.121.7056.pdf}}</ref><ref name="baileygribskov1998">{{cite journal | vauthors = Bailey TL, Gribskov M | title = Combining evidence using p-values: application to sequence homology searches | journal = Bioinformatics | volume = 14 | issue = 1 | pages = 48–54 | date = 1998 | pmid = 9520501 | doi = 10.1093/bioinformatics/14.1.48 | doi-access = free }}</ref>
 
===Non-coding multiple sequence alignment===