Sequence clustering: Difference between revisions

Content deleted Content added
Merged Linclust and MMseqs2 list items; removed one older reference to MMseqs; moved "Virus Orthologous Clusters" from "clustering tools or packages" to "databases"; removed MulitNetClust because (1) it is not a *sequence* clustering tool, i.e., it does not perform sequence comparisons itself, (2) it i accrued only 3 Google citation in 8 years since 2010, two of which are self-citations.
Tag: references removed
Citation bot (talk | contribs)
Added article-number. Removed parameters. Some additions/deletions were parameter name changes. | Use this bot. Report bugs. | #UCB_CommandLine
 
(26 intermediate revisions by 17 users not shown)
Line 1:
In [[bioinformatics]], '''sequence clustering''' [[algorithm]]s attempt to group [[biological sequence]]s that are somehow related. The sequences can be either of [[genomic]], "[[transcriptome|transcriptomic]]" ([[expressed sequence tag|ESTs]]) or [[protein]] origin.
For proteins, [[Homologyhomologous (biology)|homologoussequence]] sequencess are typically grouped into [[protein family|families]]. For EST data, clustering is important to group sequences originating from the same [[gene]] before the ESTs are [[sequence assembly|assembled]] to reconstruct the original [[mRNA]].
 
Some clustering algorithms use [[single-linkage clustering]], constructing a [[transitive closure]] of sequences with a [[sequence similarity|similarity]] over a particular threshold. UCLUST<ref name=usearch>{{cite web|url=http://www.drive5.com/usearch|title=USEARCH|work=drive5.com}}</ref> and CD-HIT<ref name=cdhit>{{cite web|url=http://cd-hit.org|title=CD-HIT: a ultra-fast method for clustering protein and nucleotide sequences, with many new applications in next generation sequencing (NGS) data|work=cd-hit.org}}</ref> use a [[greedy algorithm]] that identifies a [[representative sequences|representative sequence]] for each cluster and assigns a new sequence to that cluster if it is sufficiently similar to the representative; if a sequence is not matched then it becomes the representative sequence for a new cluster. The similarity score is often based on [[sequence alignment]]. Sequence clustering is often used to make a [[Non redundant sequence|non-redundant]] set of [[representative sequences]].
Line 7:
 
== Sequence clustering algorithms and packages ==
{{directory|date=September 2018}}
* Starcode:<ref>{{cite web|url=https://github.com/gui11aume/starcode|title=Starcode repository}}</ref> a fast sequence clustering algorithm based on exact all-pairs search.<ref>{{cite journal
|title=Starcode: sequence clustering based on all-pairs search
|author1=Zorita E |author2=Cuscó P |author3=Filion GJ. |journal=Bioinformatics
|date=Jun 2015 |volume=31
|issue=12 |pages=1913–1919
|doi=10.1093/bioinformatics/btv053
|pmid=25638815 |pmc=4765884}}</ref>
* OrthoFinder:<ref>{{cite web|url=http://www.stevekellylab.com/software/orthofinder|title=OrthoFinder|work=Steve Kelly Lab}}</ref> a fast, scalable and accurate method for clustering proteins into gene families (orthogroups)<ref>{{cite journal
|title=OrthoFinder: solving fundamental biases in whole genome comparisons dramatically improves orthogroup inference accuracy.
|author1=Emms DM |author2=Kelly S. |journal=Genome Biology
| date=Aug 2015 |volume=16
|issue=157
|pmid=26243257
|doi=10.1186/s13059-015-0721-2 |pmc=4531804}}</ref>
* [[UCLUST]] in USEARCH<ref name=usearch/>
* CD-HIT<ref name=cdhit/>
* [[UCLUST]] in USEARCH<ref name=usearch/>
* Linclust <ref>{{cite journal
* Starcode:<ref>{{cite web|url=https://github.com/gui11aume/starcode|title=Starcode repository|website=[[GitHub]]|date=2018-10-11}}</ref> a fast sequence clustering algorithm based on exact all-pairs search.<ref name="pmid25638815">{{cite journal | vauthors = Zorita E, Cuscó P, Filion GJ | title = Starcode: sequence clustering based on all-pairs search | journal = Bioinformatics | volume = 31 | issue = 12 | pages = 1913–9 | date = June 2015 | pmid = 25638815 | pmc = 4765884 | doi = 10.1093/bioinformatics/btv053 }}</ref>
|title=Clustering huge protein sequence sets in linear time
* OrthoFinder:<ref>{{cite web|url=http://www.stevekellylab.com/software/orthofinder|title=OrthoFinder|work=Steve Kelly Lab}}</ref> a fast, scalable and accurate method for clustering proteins into gene families (orthogroups)<ref name="pmid26243257">{{cite journal | vauthors = Emms DM, Kelly S | title = OrthoFinder: solving fundamental biases in whole genome comparisons dramatically improves orthogroup inference accuracy | journal = Genome Biology | volume = 16 | article-number = 157 | date = August 2015 | issue = 1 | pmid = 26243257 | pmc = 4531804 | doi = 10.1186/s13059-015-0721-2 | doi-access = free }}</ref><ref name="pmid31727128">{{cite journal | vauthors = Emms DM, Kelly S | title = OrthoFinder: phylogenetic orthology inference for comparative genomics | journal = Genome Biology | volume = 20 | issue = 1 | article-number = 238 | date = November 2019 | pmid = 31727128 | pmc = 6857279 | doi = 10.1186/s13059-019-1832-y | doi-access = free }}</ref>
|author1=Steinegger M. |author2=Söding J. |journal=Nature Communications
* Linclust:<ref name="pmid29959318">{{cite journal | vauthors = Steinegger M, Söding J | title = Clustering huge protein sequence sets in linear time | journal = Nature Communications | volume = 9 | issue = 1 | article-number = 2542 | date = June 2018 | pmid = 29959318 | pmc = 6026198 | doi = 10.1038/s41467-018-04964-5 | bibcode = 2018NatCo...9.2542S }}</ref> first algorithm whose runtime scales linearly with input set size, very fast, part of [http://mmseqs.org/ MMseqs2]<ref name="pmid29035372">{{cite journal | vauthors = Steinegger M, Söding J | title = MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets | journal = Nature Biotechnology | volume = 35 | issue = 11 | pages = 1026–1028 | date = November 2017 | pmid = 29035372 | doi = 10.1038/nbt.3988 | hdl = 11858/00-001M-0000-002E-1967-3 | s2cid = 402352 | hdl-access = free }}</ref> software suite for fast, sensitive sequence searching and clustering of large sequence sets
|date=June 2018 |volume=9
* TribeMCL: a method for clustering proteins into related groups<ref name="pmid11917018">{{cite journal | vauthors = Enright AJ, Van Dongen S, Ouzounis CA | title = An efficient algorithm for large-scale detection of protein families | journal = Nucleic Acids Research | volume = 30 | issue = 7 | pages = 1575–84 | date = April 2002 | pmid = 11917018 | pmc = 101833 | doi = 10.1093/nar/30.7.1575 }}</ref>
|pages=2542
* BAG: a graph theoretic sequence clustering algorithm<ref>{{cite web |url=http://bio.informatics.indiana.edu/sunkim/BAG/ |title=Archived copy |accessdateaccess-date=2004-02-19 |deadurlurl-status=yesdead |archiveurlarchive-url=https://web.archive.org/web/20031206172749/http://bio.informatics.indiana.edu/sunkim/BAG/ |archivedatearchive-date=2003-12-06 |df= }}</ref>
|doi=10.1038/s41467-018-04964-5
|pmid= 29959318}}</ref>: first algorithm whose runtime scales linearly with input set size, very fast, part of [http://mmseqs.org/ MMseqs2] <ref>{{cite journal
|title=MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets
|author1=Steinegger M. |author2=Söding J. |journal=Nature Biotechnology
|date=Oct 16, 2017 |volume=
|issue= |pages=
|doi=10.1038/nbt.3988
|pmid= 29035372}}</ref> software suite for fast and deep clustering of large protein sequence sets
* nrdb90.pl<ref name=rdb90>{{cite journal|pmid=9682055
|journal=Bioinformatics
| date=Jun 1998 |volume=14
|issue=5
|pages=423–9.
|title=Removing near-neighbour redundancy from large protein sequence collections.
|author=Holm L1, Sander C.
|doi=10.1093/bioinformatics/14.5.423
}}</ref>
* TribeMCL: a method for clustering proteins into related groups<ref>{{cite journal
|title=An efficient algorithm for large-scale detection of protein families.
|author1=Enright AJ |author2=Van Dongen S |author3=Ouzounis CA. |journal=Nucleic Acids Res.
| date=Apr 2002 |volume=30
|issue=7
|pages=1575–84
|pmid=11917018
|doi=10.1093/nar/30.7.1575 |pmc=101833}}</ref>
* BAG: a graph theoretic sequence clustering algorithm<ref>{{cite web |url=http://bio.informatics.indiana.edu/sunkim/BAG/ |title=Archived copy |accessdate=2004-02-19 |deadurl=yes |archiveurl=https://web.archive.org/web/20031206172749/http://bio.informatics.indiana.edu/sunkim/BAG/ |archivedate=2003-12-06 |df= }}</ref>
* JESAM:<ref>{{cite web|url=http://www.littlest.co.uk/software/bioinf/old_packages/jesam/jesam_paper.html|title=Bioinformatics Paper: JESAM: CORBA software components for EST alignments and clusters|work=littlest.co.uk}}</ref> Open source parallel scalable DNA alignment engine with optional clustering software component
* UICluster:<ref>{{cite web |url=http://ratest.eng.uiowa.edu/pubsoft/clustering/ |title=pedretti@eyeball -- Clustering Page |website=ratest.eng.uiowa.edu |url-status=dead |archive-url=https://web.archive.org/web/20050409134817/http://ratest.eng.uiowa.edu/pubsoft/clustering/ |archive-date=2005-04-09}} </ref> Parallel Clustering of EST (Gene) Sequences
* BLASTClust single-linkage clustering with BLAST<ref>{{cite web|url=https://www.ncbi.nlm.nih.gov/Web/Newsltr/Spring04/blastlab.html|title=NCBI News: Spring 2004-BLASTLab|work=nih.gov}}</ref>
* Clusterer:<ref>{{cite web|url=http://bugaco.com/bioinf/clusterer/|title=Clusterer: extendable java application for sequence grouping and cluster analyses|work=bugaco.com}}</ref> extendable java application for sequence grouping and cluster analyses
* PATDB: a program for rapidly identifying perfect substrings
* nrdb:<ref>{{Cite web | url=http://blast.wustl.edu/pub/nrdb/ | title=Index of /pub/nrdb| archive-url=https://web.archive.org/web/20080101032917/http://blast.wustl.edu/pub/nrdb/| archive-date=2008-01-01}}</ref> a program for merging trivially redundant (identical) sequences
* CluSTr:<ref>{{cite web |url=http://www.ebi.ac.uk/clustr/ |title=Archived copyCluSTr |accessdateaccess-date=2006-11-23 |deadurlurl-status=yesdead |archiveurlarchive-url=https://web.archive.org/web/20060924012903/http://www.ebi.ac.uk/clustr/ |archivedatearchive-date=2006-09-24 |df= }}</ref> A single-linkage protein sequence clustering database from Smith-Waterman sequence similarities; covers over 7 mln sequences including UniProt and IPI
* ICAtools<ref>{{cite web|url=http://www.littlest.co.uk/software/bioinf/old_packages/icatools/|title=Introduction to the ICAtools|work=littlest.co.uk}}</ref> - original (ancient) DNA clustering package with many algorithms useful for artifact discovery or EST clustering
* Skipredudant EMBOSS tool<ref>{{cite web|url=http://bioweb2.pasteur.fr/docs/EMBOSS/skipredundant.html|title=EMBOSS: skipredundant|work=pasteur.fr}}</ref> to remove redundant sequences from a set
* CLUSS Algorithm<ref name="pmid17683581">{{cite webjournal |url vauthors =https Kelil A, Wang S, Brzezinski R, Fleury A | title = CLUSS://bmcbioinformatics.biomedcentral.com/articles/ clustering of protein sequences based on a new similarity measure | journal = BMC Bioinformatics | volume = 8 | article-number = 286 | date = August 2007 | pmid = 17683581 | pmc = 1976428 | doi = 10.1186/1471-2105-8-286 |title=CLUSS Algorithm : Clustering nondoi-alignable proteinaccess sequences|work=prospectus.usherbrooke.ca free }}</ref> to identify groups of structurally, functionally, or evolutionarily related hard-to-align protein sequences. CLUSS webserver <ref name="prospectus.usherbrooke.ca">{{Cite web | url=http://prospectus.usherbrooke.ca/CLUSS/ | title=CLUSS Home Page}}</ref>
* CLUSS2 Algorithm<ref name="pmid20058485">{{cite webjournal |url vauthors =https://www.inderscienceonline.com/doi/abs/10.1504/IJCBDD.2008.02019 Kelil A, Wang S, Brzezinski R | title =CLUSS2 CLUSS2: Alignmentan alignment-independent algorithm for clustering protein families with multiple biological functions |work journal =www International Journal of Computational Biology and Drug Design | volume = 1 | issue = 2 | pages = 122–40 | date = 2008 | pmid = 20058485 | doi = 10.inderscienceonline1504/ijcbdd.com2008.020190 }}</ref> for clustering families of hard-to-align protein sequences with multiple biological functions. CLUSS2 webserver <ref>http:// name="prospectus.usherbrooke.ca"/CLUSS/</ref>
<!-- Lets try the above (although both are wobbly) -->
<!-- * [http://bio.cc/RSDB RSDB] broken link -->
Line 70 ⟶ 31:
== Non-redundant sequence databases ==
* PISCES: A Protein Sequence Culling Server<ref>{{cite web|url=http://dunbrack.fccc.edu/pisces/|title=Dunbrack Lab|work=fccc.edu}}</ref>
* RDB90<ref name=rdb90>{{cite journal | vauthors = Holm L, Sander C | title = Removing near-neighbour redundancy from large protein sequence collections | journal = Bioinformatics | volume = 14 | issue = 5 | pages = 423–9 | date = June 1998 | pmid = 9682055 | doi = 10.1093/bioinformatics/14.5.423 | doi-access = free }}</ref>
* RDB90<ref name=rdb90/>
* UniRef: A non-redundant [[UniProt]] sequence database<ref>{{cite web|url=https://www.uniprot.org/database/DBDescription.shtml#uniref|title=About UniProt|work=uniprot.org}}</ref>
* Uniclust: A clustered UniProtKB sequences at the level of 90%, 50% and 30% pairwise sequence identity.<ref name="pmid27899574">{{cite journal | vauthors = Mirdita M, von den Driesch L, Galiez C, Martin MJ, Söding J, Steinegger M | title = Uniclust databases of clustered and deeply annotated protein sequences and alignments | journal = Nucleic Acids Research | volume = 45 | issue = D1 | pages = D170–D176 | date = January 2017 | pmid = 27899574 | pmc = 5614098 | doi = 10.1093/nar/gkw1081 }}</ref>
|title=Uniclust databases of clustered and deeply annotated protein sequences and alignments
|author1=Mirdita M |author2=von den Drisch L. |author3=Galiez C. |author4=Soeding J. |author5= Steinegger M. |journal=Nucleic Acids Res.
|date= Nov 2016 |volume=45
|issue=D1 |pages= D170–D176
|doi= 10.1093/nar/gkw1081}}</ref>
* Virus Orthologous Clusters:<ref>{{cite web|url=http://athena.bioc.uvic.ca/tools/VOCS|title=VOCS - Viral Bioinformatics Resource Center|work=uvic.ca}}</ref> A viral protein sequence clustering database; contains all predicted genes from eleven virus families organized into ortholog groups by BLASTP similarity
 
==See also==
*[[Cluster analysis]]
*[[Social sequence analysis]]
 
==References==