Revision as of 07:03, 24 September 2015 edit Lotje (talk \| contribs) Extended confirmed users, File movers, Pending changes reviewers, Rollbackers 142,851 edits <ref></ref> ← Previous edit		Revision as of 07:05, 24 September 2015 edit undo Lotje (talk \| contribs) Extended confirmed users, File movers, Pending changes reviewers, Rollbackers 142,851 edits m Filled in 12 bare reference(s) with reFill () Next edit →
Line 2: For proteins, [[Homology (biology)\|homologous]] sequences are typically grouped into [[protein family\|families]]. For EST data, clustering is important to group sequences originating from the same [[gene]] before the ESTs are [[sequence assembly\|assembled]] to reconstruct the original [[mRNA]]. Some clustering algorithms use [[single-linkage clustering]], constructing a [[transitive closure]] of sequences with a [[sequence similarity\|similarity]] over a particular threshold. UCLUST<ref name=usearch>[{{cite web\|url=http://www.drive5.com/usearch \|title=USEARCH~~: An exceptionally fast sequence clustering program for nucleotide and protein sequences]~~\|work=drive5.com}}</ref> and CD-HIT<ref name=cdhit>[{{cite web\|url=http://cd-hit.org \|title=CD-HIT: a ultra-fast method for clustering protein and nucleotide sequences, with many new applications in next generation sequencing (NGS) data]\|work=cd-hit.org}}</ref> use a [[greedy algorithm]] that identifies a [[representative sequences\|representative sequence]] for each cluster and assigns a new sequence to that cluster if it is sufficiently similar to the representative; if a sequence is not matched then it becomes the representative sequence for a new cluster. The similarity score is often based on [[sequence alignment]]. Sequence clustering is often used to make a [[Non redundant sequence\|non-redundant]] set of [[representative sequences]]. Sequence clusters are often synonymous with (but not identical to) [[protein family\|protein families]]. Determining a representative [[tertiary structure]] for each sequence cluster is the aim of many [[structural genomics]] initiatives. == Sequence clustering algorithms and packages == * OrthoFinder:<ref>{{cite web\|url=http://www.stevekellylab.com/software/orthofinder\|title=OrthoFinder\|work=Steve Kelly Lab}}</ref> a fast, scalable and accurate method for clustering proteins into gene families (orthogroups)<ref>{{cite journal \|title=OrthoFinder: solving fundamental biases in whole genome comparisons dramatically improves orthogroup inference accuracy. \|author=Emms DM, Kelly S. Line 36: \|doi=10.1093/nar/30.7.1575}}</ref> * BAG: a graph theoretic sequence clustering algorithm<ref>http://bio.informatics.indiana.edu/sunkim/BAG/</ref> * JESAM:<ref>{{cite web\|url=http://www.littlest.co.uk/software/bioinf/old_packages/jesam/jesam_paper.html\|title=Bioinformatics Paper: JESAM: CORBA software components for EST alignments and clusters\|work=littlest.co.uk}}</ref> Open source parallel scalable DNA alignment engine with optional clustering software component * UICluster:<ref>http://ratest.eng.uiowa.edu/pubsoft/clustering/</ref> Parallel Clustering of EST (Gene) Sequences * BLASTClust single-linkage clustering with BLAST<ref>{{cite web\|url=http://www.ncbi.nlm.nih.gov/Web/Newsltr/Spring04/blastlab.html\|title=NCBI News: Spring 2004-BLASTLab\|work=nih.gov}}</ref> * (Multi)netclust:<ref>{{cite web\|url=http://www.bioinformatics.nl/netclust/\|title=WUR Multi-netclust web server\|work=bioinformatics.nl}}</ref> fast and memory-efficient detection of connected clusters in (multi-parametric) data networks<ref>{{cite journal \|title=Multi-netclust: an efficient tool for finding connected clusters in multi-parametric networks \|author=Kuzniar, A., Dhir, S., Nijveen, H., Pongor, S. and Leunissen, J. A. M. Line 49: \|pmid=20679333 \|doi=10.1093/bioinformatics/btq435}}</ref> * Clusterer:<ref>{{cite web\|url=http://bugaco.com/bioinf/clusterer/\|title=Clusterer: extendable java application for sequence grouping and cluster analyses\|work=bugaco.com}}</ref> extendable java application for sequence grouping and cluster analyses * PATDB: a program for rapidly identifying perfect substrings * nrdb:<ref>http://web.archive.org/web/20080101032917/http://blast.wustl.edu/pub/nrdb/</ref> a program for merging trivially redundant (identical) sequences * CluSTr:<ref>http://www.ebi.ac.uk/clustr/</ref> A single-linkage protein sequence clustering database from Smith-Waterman sequence similarities; covers over 7 mln sequences including UniProt and IPI * ICAtools<ref>{{cite web\|url=http://www.littlest.co.uk/software/bioinf/old_packages/icatools/\|title=Introduction to the ICAtools\|work=littlest.co.uk}}</ref> - original (ancient) DNA clustering package with many algorithms useful for artifact discovery or EST clustering * Virus Orthologous Clusters:<ref>{{cite web\|url=http://athena.bioc.uvic.ca/tools/VOCS\|title=VOCS - Viral Bioinformatics Resource Center\|work=uvic.ca}}</ref> A viral protein sequence clustering database; contains all predicted genes from eleven virus families organized into ortholog groups by BLASTP similarity * Skipredudant EMBOSS tool<ref>{{cite web\|url=http://bioweb2.pasteur.fr/docs/EMBOSS/skipredundant.html\|title=EMBOSS: skipredundant\|work=pasteur.fr}}</ref> to remove redundant sequences from a set <!-- Lets try the above (although both are wobbly) --> <!-- * [http://bio.cc/RSDB RSDB] broken link --> == Non-redundant sequence databases == * PISCES: A Protein Sequence Culling Server<ref>{{cite web\|url=http://dunbrack.fccc.edu/pisces/\|title=Dunbrack Lab\|work=fccc.edu}}</ref> * RDB90<ref name=rdb90/> * UniRef: A non-redundant [[UniProt]] sequence database<ref>[{{cite web\|url=http://www.uniprot.org/database/DBDescription.shtml#uniref ~~UniRef: A non-redundant~~\|title=About UniProt ~~sequence database]~~\|work=uniprot.org}}</ref> ==See also==

Sequence clustering: Difference between revisions