Sequence clustering: Difference between revisions

Content deleted Content added
+ link to seq homology
Line 1:
In [[bioinformatics]], '''sequence clustering''' [[algorithm]]s attempt to group [[biological sequence]]s that are somehow related. The sequences can be either of [[genomic]], "[[transcriptome|transcriptomic]]" ([[expressed sequence tag|ESTs]]) or [[protein]] origin.
For proteins, [[HomologyHomologous (biology)sequence|homologous]] sequences]] are typically grouped into [[protein family|families]]. For EST data, clustering is important to group sequences originating from the same [[gene]] before the ESTs are [[sequence assembly|assembled]] to reconstruct the original [[mRNA]].
 
Some clustering algorithms use [[single-linkage clustering]], constructing a [[transitive closure]] of sequences with a [[sequence similarity|similarity]] over a particular threshold. UCLUST<ref name=usearch>{{cite web|url=http://www.drive5.com/usearch|title=USEARCH|work=drive5.com}}</ref> and CD-HIT<ref name=cdhit>{{cite web|url=http://cd-hit.org|title=CD-HIT: a ultra-fast method for clustering protein and nucleotide sequences, with many new applications in next generation sequencing (NGS) data|work=cd-hit.org}}</ref> use a [[greedy algorithm]] that identifies a [[representative sequences|representative sequence]] for each cluster and assigns a new sequence to that cluster if it is sufficiently similar to the representative; if a sequence is not matched then it becomes the representative sequence for a new cluster. The similarity score is often based on [[sequence alignment]]. Sequence clustering is often used to make a [[Non redundant sequence|non-redundant]] set of [[representative sequences]].