Content deleted Content added
GoingBatty (talk | contribs) |
+ link to seq homology |
||
Line 1:
In [[bioinformatics]], '''sequence clustering''' [[algorithm]]s attempt to group [[biological sequence]]s that are somehow related. The sequences can be either of [[genomic]], "[[transcriptome|transcriptomic]]" ([[expressed sequence tag|ESTs]]) or [[protein]] origin.
For proteins, [[
Some clustering algorithms use [[single-linkage clustering]], constructing a [[transitive closure]] of sequences with a [[sequence similarity|similarity]] over a particular threshold. UCLUST<ref name=usearch>{{cite web|url=http://www.drive5.com/usearch|title=USEARCH|work=drive5.com}}</ref> and CD-HIT<ref name=cdhit>{{cite web|url=http://cd-hit.org|title=CD-HIT: a ultra-fast method for clustering protein and nucleotide sequences, with many new applications in next generation sequencing (NGS) data|work=cd-hit.org}}</ref> use a [[greedy algorithm]] that identifies a [[representative sequences|representative sequence]] for each cluster and assigns a new sequence to that cluster if it is sufficiently similar to the representative; if a sequence is not matched then it becomes the representative sequence for a new cluster. The similarity score is often based on [[sequence alignment]]. Sequence clustering is often used to make a [[Non redundant sequence|non-redundant]] set of [[representative sequences]].
|