Sequence clustering: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 05:10, 16 August 2018 edit 134.76.223.13 (talk) Merged Linclust and MMseqs2 list items; removed one older reference to MMseqs; moved "Virus Orthologous Clusters" from "clustering tools or packages" to "databases"; removed MulitNetClust because (1) it is not a sequence clustering tool, i.e., it does not perform sequence comparisons itself, (2) it i accrued only 3 Google citation in 8 years since 2010, two of which are self-citations. Tag: references removed ← Previous edit		Latest revision as of 17:01, 18 July 2025 edit undo Citation bot (talk \| contribs) Bots 5,864,765 edits Added article-number. Removed parameters. Some additions/deletions were parameter name changes. \| Use this bot. Report bugs. \| #UCB_CommandLine
(26 intermediate revisions by 17 users not shown)
Line 1: In [[bioinformatics]], '''sequence clustering''' [[algorithm]]s attempt to group [[biological sequence]]s that are somehow related. The sequences can be either of [[genomic]], "[[transcriptome\|transcriptomic]]" ([[expressed sequence tag\|ESTs]]) or [[protein]] origin. For proteins, [[~~Homology~~homologous ~~(biology)\|homologous~~sequence]] ~~sequences~~s are typically grouped into [[protein family\|families]]. For EST data, clustering is important to group sequences originating from the same [[gene]] before the ESTs are [[sequence assembly\|assembled]] to reconstruct the original [[mRNA]]. Some clustering algorithms use [[single-linkage clustering]], constructing a [[transitive closure]] of sequences with a [[sequence similarity\|similarity]] over a particular threshold. UCLUST<ref name=usearch>{{cite web\|url=http://www.drive5.com/usearch\|title=USEARCH\|work=drive5.com}}</ref> and CD-HIT<ref name=cdhit>{{cite web\|url=http://cd-hit.org\|title=CD-HIT: a ultra-fast method for clustering protein and nucleotide sequences, with many new applications in next generation sequencing (NGS) data\|work=cd-hit.org}}</ref> use a [[greedy algorithm]] that identifies a [[representative sequences\|representative sequence]] for each cluster and assigns a new sequence to that cluster if it is sufficiently similar to the representative; if a sequence is not matched then it becomes the representative sequence for a new cluster. The similarity score is often based on [[sequence alignment]]. Sequence clustering is often used to make a [[Non redundant sequence\|non-redundant]] set of [[representative sequences]]. Line 7: == Sequence clustering algorithms and packages == {{directory\|date=September 2018}} * Starcode:<ref>{{cite web\|url=https://github.com/gui11aume/starcode\|title=Starcode repository}}</ref> a fast sequence clustering algorithm based on exact all-pairs search.<ref>{{cite journal ~~\|title=Starcode: sequence clustering based on all-pairs search~~ ~~\|author1=Zorita E \|author2=Cuscó P \|author3=Filion GJ. \|journal=Bioinformatics~~ ~~\|date=Jun 2015 \|volume=31~~ ~~\|issue=12 \|pages=1913–1919~~ ~~\|doi=10.1093/bioinformatics/btv053~~ ~~\|pmid=25638815 \|pmc=4765884}}</ref>~~ * OrthoFinder:<ref>{{cite web\|url=http://www.stevekellylab.com/software/orthofinder\|title=OrthoFinder\|work=Steve Kelly Lab}}</ref> a fast, scalable and accurate method for clustering proteins into gene families (orthogroups)<ref>{{cite journal ~~\|title=OrthoFinder: solving fundamental biases in whole genome comparisons dramatically improves orthogroup inference accuracy.~~ ~~\|author1=Emms DM \|author2=Kelly S. \|journal=Genome Biology~~ ~~\| date=Aug 2015 \|volume=16~~ ~~\|issue=157~~ ~~\|pmid=26243257~~ ~~\|doi=10.1186/s13059-015-0721-2 \|pmc=4531804}}</ref>~~ * [[UCLUST]] in USEARCH<ref name=usearch/>▼ * CD-HIT<ref name=cdhit/> ▲* [[UCLUST]] in USEARCH<ref name=usearch/> * Linclust <ref>{{cite journal * Starcode:<ref>{{cite web\|url=https://github.com/gui11aume/starcode\|title=Starcode repository\|website=[[GitHub]]\|date=2018-10-11}}</ref> a fast sequence clustering algorithm based on exact all-pairs search.<ref name="pmid25638815">{{cite journal \| vauthors = Zorita E, Cuscó P, Filion GJ \| title = Starcode: sequence clustering based on all-pairs search \| journal = Bioinformatics \| volume = 31 \| issue = 12 \| pages = 1913–9 \| date = June 2015 \| pmid = 25638815 \| pmc = 4765884 \| doi = 10.1093/bioinformatics/btv053 }}</ref> ~~\|title=Clustering huge protein sequence sets in linear time~~ * OrthoFinder:<ref>{{cite web\|url=http://www.stevekellylab.com/software/orthofinder\|title=OrthoFinder\|work=Steve Kelly Lab}}</ref> a fast, scalable and accurate method for clustering proteins into gene families (orthogroups)<ref name="pmid26243257">{{cite journal \| vauthors = Emms DM, Kelly S \| title = OrthoFinder: solving fundamental biases in whole genome comparisons dramatically improves orthogroup inference accuracy \| journal = Genome Biology \| volume = 16 \| article-number = 157 \| date = August 2015 \| issue = 1 \| pmid = 26243257 \| pmc = 4531804 \| doi = 10.1186/s13059-015-0721-2 \| doi-access = free }}</ref><ref name="pmid31727128">{{cite journal \| vauthors = Emms DM, Kelly S \| title = OrthoFinder: phylogenetic orthology inference for comparative genomics \| journal = Genome Biology \| volume = 20 \| issue = 1 \| article-number = 238 \| date = November 2019 \| pmid = 31727128 \| pmc = 6857279 \| doi = 10.1186/s13059-019-1832-y \| doi-access = free }}</ref> ~~\|author1=Steinegger M. \|author2=Söding J. \|journal=Nature Communications~~ * Linclust:<ref name="pmid29959318">{{cite journal \| vauthors = Steinegger M, Söding J \| title = Clustering huge protein sequence sets in linear time \| journal = Nature Communications \| volume = 9 \| issue = 1 \| article-number = 2542 \| date = June 2018 \| pmid = 29959318 \| pmc = 6026198 \| doi = 10.1038/s41467-018-04964-5 \| bibcode = 2018NatCo...9.2542S }}</ref> first algorithm whose runtime scales linearly with input set size, very fast, part of [http://mmseqs.org/ MMseqs2]<ref name="pmid29035372">{{cite journal \| vauthors = Steinegger M, Söding J \| title = MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets \| journal = Nature Biotechnology \| volume = 35 \| issue = 11 \| pages = 1026–1028 \| date = November 2017 \| pmid = 29035372 \| doi = 10.1038/nbt.3988 \| hdl = 11858/00-001M-0000-002E-1967-3 \| s2cid = 402352 \| hdl-access = free }}</ref> software suite for fast, sensitive sequence searching and clustering of large sequence sets ~~\|date=June 2018 \|volume=9~~ * TribeMCL: a method for clustering proteins into related groups<ref name="pmid11917018">{{cite journal \| vauthors = Enright AJ, Van Dongen S, Ouzounis CA \| title = An efficient algorithm for large-scale detection of protein families \| journal = Nucleic Acids Research \| volume = 30 \| issue = 7 \| pages = 1575–84 \| date = April 2002 \| pmid = 11917018 \| pmc = 101833 \| doi = 10.1093/nar/30.7.1575 }}</ref> ~~\|pages=2542~~ * BAG: a graph theoretic sequence clustering algorithm<ref>{{cite web \|url=http://bio.informatics.indiana.edu/sunkim/BAG/ \|title=Archived copy \|~~accessdate~~access-date=2004-02-19 \|~~deadurl~~url-status=~~yes~~dead \|~~archiveurl~~archive-url=https://web.archive.org/web/20031206172749/http://bio.informatics.indiana.edu/sunkim/BAG/ \|~~archivedate~~archive-date=2003-12-06 ~~\|df=~~ }}</ref>▼ ~~\|doi=10.1038/s41467-018-04964-5~~ ~~\|pmid= 29959318}}</ref>: first algorithm whose runtime scales linearly with input set size, very fast, part of [http://mmseqs.org/ MMseqs2] <ref>{{cite journal~~ ~~\|title=MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets~~ ~~\|author1=Steinegger M. \|author2=Söding J. \|journal=Nature Biotechnology~~ ~~\|date=Oct 16, 2017 \|volume=~~ ~~\|issue= \|pages=~~ ~~\|doi=10.1038/nbt.3988~~ ~~\|pmid= 29035372}}</ref> software suite for fast and deep clustering of large protein sequence sets~~ * nrdb90.pl<ref name=rdb90>{{cite journal\|pmid=9682055 ~~\|journal=Bioinformatics~~ ~~\| date=Jun 1998 \|volume=14~~ ~~\|issue=5~~ ~~\|pages=423–9.~~ ~~\|title=Removing near-neighbour redundancy from large protein sequence collections.~~ ~~\|author=Holm L1, Sander C.~~ ~~\|doi=10.1093/bioinformatics/14.5.423~~ ~~}}</ref>~~ * TribeMCL: a method for clustering proteins into related groups<ref>{{cite journal ~~\|title=An efficient algorithm for large-scale detection of protein families.~~ ~~\|author1=Enright AJ \|author2=Van Dongen S \|author3=Ouzounis CA. \|journal=Nucleic Acids Res.~~ ~~\| date=Apr 2002 \|volume=30~~ ~~\|issue=7~~ ~~\|pages=1575–84~~ ~~\|pmid=11917018~~ ~~\|doi=10.1093/nar/30.7.1575 \|pmc=101833}}</ref>~~ ▲* BAG: a graph theoretic sequence clustering algorithm<ref>{{cite web \|url=http://bio.informatics.indiana.edu/sunkim/BAG/ \|title=Archived copy \|accessdate=2004-02-19 \|deadurl=yes \|archiveurl=https://web.archive.org/web/20031206172749/http://bio.informatics.indiana.edu/sunkim/BAG/ \|archivedate=2003-12-06 \|df= }}</ref> * JESAM:<ref>{{cite web\|url=http://www.littlest.co.uk/software/bioinf/old_packages/jesam/jesam_paper.html\|title=Bioinformatics Paper: JESAM: CORBA software components for EST alignments and clusters\|work=littlest.co.uk}}</ref> Open source parallel scalable DNA alignment engine with optional clustering software component * UICluster:<ref>{{cite web \|url=http://ratest.eng.uiowa.edu/pubsoft/clustering/ \|title=pedretti@eyeball -- Clustering Page \|website=ratest.eng.uiowa.edu \|url-status=dead \|archive-url=https://web.archive.org/web/20050409134817/http://ratest.eng.uiowa.edu/pubsoft/clustering/ \|archive-date=2005-04-09}} </ref> Parallel Clustering of EST (Gene) Sequences * BLASTClust single-linkage clustering with BLAST<ref>{{cite web\|url=https://www.ncbi.nlm.nih.gov/Web/Newsltr/Spring04/blastlab.html\|title=NCBI News: Spring 2004-BLASTLab\|work=nih.gov}}</ref> * Clusterer:<ref>{{cite web\|url=http://bugaco.com/bioinf/clusterer/\|title=Clusterer: extendable java application for sequence grouping and cluster analyses\|work=bugaco.com}}</ref> extendable java application for sequence grouping and cluster analyses * PATDB: a program for rapidly identifying perfect substrings * nrdb:<ref>{{Cite web \| url=http://blast.wustl.edu/pub/nrdb/ \| title=Index of /pub/nrdb\| archive-url=https://web.archive.org/web/20080101032917/http://blast.wustl.edu/pub/nrdb/\| archive-date=2008-01-01}}</ref> a program for merging trivially redundant (identical) sequences * CluSTr:<ref>{{cite web \|url=http://www.ebi.ac.uk/clustr/ \|title=~~Archived copy~~CluSTr \|~~accessdate~~access-date=2006-11-23 \|~~deadurl~~url-status=~~yes~~dead \|~~archiveurl~~archive-url=https://web.archive.org/web/20060924012903/http://www.ebi.ac.uk/clustr/ \|~~archivedate~~archive-date=2006-09-24 ~~\|df=~~ }}</ref> A single-linkage protein sequence clustering database from Smith-Waterman sequence similarities; covers over 7 mln sequences including UniProt and IPI * ICAtools<ref>{{cite web\|url=http://www.littlest.co.uk/software/bioinf/old_packages/icatools/\|title=Introduction to the ICAtools\|work=littlest.co.uk}}</ref> - original (ancient) DNA clustering package with many algorithms useful for artifact discovery or EST clustering * Skipredudant EMBOSS tool<ref>{{cite web\|url=http://bioweb2.pasteur.fr/docs/EMBOSS/skipredundant.html\|title=EMBOSS: skipredundant\|work=pasteur.fr}}</ref> to remove redundant sequences from a set * CLUSS Algorithm<ref name="pmid17683581">{{cite ~~web~~journal \|~~url~~ vauthors =~~https~~ Kelil A, Wang S, Brzezinski R, Fleury A \| title = CLUSS:~~//bmcbioinformatics.biomedcentral.com/articles/~~ clustering of protein sequences based on a new similarity measure \| journal = BMC Bioinformatics \| volume = 8 \| article-number = 286 \| date = August 2007 \| pmid = 17683581 \| pmc = 1976428 \| doi = 10.1186/1471-2105-8-286 \|~~title=CLUSS~~ ~~Algorithm : Clustering non~~doi-~~alignable protein~~access ~~sequences\|work~~=~~prospectus.usherbrooke.ca~~ free }}</ref> to identify groups of structurally, functionally, or evolutionarily related hard-to-align protein sequences. CLUSS webserver <ref name="prospectus.usherbrooke.ca">{{Cite web \| url=http://prospectus.usherbrooke.ca/CLUSS/ \| title=CLUSS Home Page}}</ref> * CLUSS2 Algorithm<ref name="pmid20058485">{{cite ~~web~~journal \|~~url~~ vauthors =~~https://www.inderscienceonline.com/doi/abs/10.1504/IJCBDD.2008.02019~~ Kelil A, Wang S, Brzezinski R \| title =~~CLUSS2~~ CLUSS2: ~~Alignment~~an alignment-independent algorithm for clustering protein families with multiple biological functions \|~~work~~ journal =~~www~~ International Journal of Computational Biology and Drug Design \| volume = 1 \| issue = 2 \| pages = 122–40 \| date = 2008 \| pmid = 20058485 \| doi = 10.~~inderscienceonline~~1504/ijcbdd.~~com~~2008.020190 }}</ref> for clustering families of hard-to-align protein sequences with multiple biological functions. CLUSS2 webserver <ref~~>http://~~ name="prospectus.usherbrooke.ca"/~~CLUSS/</ref~~> <!-- Lets try the above (although both are wobbly) --> <!-- * [http://bio.cc/RSDB RSDB] broken link --> Line 70 ⟶ 31: == Non-redundant sequence databases == * PISCES: A Protein Sequence Culling Server<ref>{{cite web\|url=http://dunbrack.fccc.edu/pisces/\|title=Dunbrack Lab\|work=fccc.edu}}</ref> * RDB90<ref name=rdb90>{{cite journal \| vauthors = Holm L, Sander C \| title = Removing near-neighbour redundancy from large protein sequence collections \| journal = Bioinformatics \| volume = 14 \| issue = 5 \| pages = 423–9 \| date = June 1998 \| pmid = 9682055 \| doi = 10.1093/bioinformatics/14.5.423 \| doi-access = free }}</ref> * RDB90<ref name=rdb90/> * UniRef: A non-redundant [[UniProt]] sequence database<ref>{{cite web\|url=https://www.uniprot.org/database/DBDescription.shtml#uniref\|title=About UniProt\|work=uniprot.org}}</ref> * Uniclust: A clustered UniProtKB sequences at the level of 90%, 50% and 30% pairwise sequence identity.<ref name="pmid27899574">{{cite journal \| vauthors = Mirdita M, von den Driesch L, Galiez C, Martin MJ, Söding J, Steinegger M \| title = Uniclust databases of clustered and deeply annotated protein sequences and alignments \| journal = Nucleic Acids Research \| volume = 45 \| issue = D1 \| pages = D170–D176 \| date = January 2017 \| pmid = 27899574 \| pmc = 5614098 \| doi = 10.1093/nar/gkw1081 }}</ref> ~~\|title=Uniclust databases of clustered and deeply annotated protein sequences and alignments~~ ~~\|author1=Mirdita M \|author2=von den Drisch L. \|author3=Galiez C. \|author4=Soeding J. \|author5= Steinegger M. \|journal=Nucleic Acids Res.~~ ~~\|date= Nov 2016 \|volume=45~~ ~~\|issue=D1 \|pages= D170–D176~~ ~~\|doi= 10.1093/nar/gkw1081}}</ref>~~ * Virus Orthologous Clusters:<ref>{{cite web\|url=http://athena.bioc.uvic.ca/tools/VOCS\|title=VOCS - Viral Bioinformatics Resource Center\|work=uvic.ca}}</ref> A viral protein sequence clustering database; contains all predicted genes from eleven virus families organized into ortholog groups by BLASTP similarity ==See also== [[Cluster analysis]] [[Social sequence analysis]] ==References==