Revision as of 05:05, 3 September 2018 edit GoingBatty (talk \| contribs) Autopatrolled, Extended confirmed users, IP block exemptions, Pending changes reviewers, Rollbackers 662,941 edits m →Sequence clustering algorithms and packages: General fixes Tag: AWB ← Previous edit		Revision as of 00:24, 30 September 2018 edit undo Evolution and evolvability (talk \| contribs) Extended confirmed users 24,414 edits + link to seq homology Tag: Visual edit Next edit →
Line 1: In [[bioinformatics]], '''sequence clustering''' [[algorithm]]s attempt to group [[biological sequence]]s that are somehow related. The sequences can be either of [[genomic]], "[[transcriptome\|transcriptomic]]" ([[expressed sequence tag\|ESTs]]) or [[protein]] origin. For proteins, [[~~Homology~~Homologous ~~(biology)~~sequence\|homologous]] sequences]] are typically grouped into [[protein family\|families]]. For EST data, clustering is important to group sequences originating from the same [[gene]] before the ESTs are [[sequence assembly\|assembled]] to reconstruct the original [[mRNA]]. Some clustering algorithms use [[single-linkage clustering]], constructing a [[transitive closure]] of sequences with a [[sequence similarity\|similarity]] over a particular threshold. UCLUST<ref name=usearch>{{cite web\|url=http://www.drive5.com/usearch\|title=USEARCH\|work=drive5.com}}</ref> and CD-HIT<ref name=cdhit>{{cite web\|url=http://cd-hit.org\|title=CD-HIT: a ultra-fast method for clustering protein and nucleotide sequences, with many new applications in next generation sequencing (NGS) data\|work=cd-hit.org}}</ref> use a [[greedy algorithm]] that identifies a [[representative sequences\|representative sequence]] for each cluster and assigns a new sequence to that cluster if it is sufficiently similar to the representative; if a sequence is not matched then it becomes the representative sequence for a new cluster. The similarity score is often based on [[sequence alignment]]. Sequence clustering is often used to make a [[Non redundant sequence\|non-redundant]] set of [[representative sequences]].

Sequence clustering: Difference between revisions