Sequence clustering: Difference between revisions

Content deleted Content added
Line 7:
 
== Sequence clustering algorithms and packages ==
{{directory|date=September 2018}}
* CD-HIT<ref name=cdhit/>
* [[UCLUST]] in USEARCH<ref name=usearch/>
Line 24:
|pmid=26243257
|doi=10.1186/s13059-015-0721-2 |pmc=4531804}}</ref>
* Linclust :<ref>{{cite journal
|title=Clustering huge protein sequence sets in linear time
|author1=Steinegger M. |author2=Söding J. |journal=Nature Communications
Line 30:
|pages=2542
|doi=10.1038/s41467-018-04964-5
|pmid= 29959318}}</ref>: first algorithm whose runtime scales linearly with input set size, very fast, part of [http://mmseqs.org/ MMseqs2] <ref>{{cite journal
|title=MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets
|author1=Steinegger M. |author2=Söding J. |journal=Nature Biotechnology
Line 55:
* ICAtools<ref>{{cite web|url=http://www.littlest.co.uk/software/bioinf/old_packages/icatools/|title=Introduction to the ICAtools|work=littlest.co.uk}}</ref> - original (ancient) DNA clustering package with many algorithms useful for artifact discovery or EST clustering
* Skipredudant EMBOSS tool<ref>{{cite web|url=http://bioweb2.pasteur.fr/docs/EMBOSS/skipredundant.html|title=EMBOSS: skipredundant|work=pasteur.fr}}</ref> to remove redundant sequences from a set
* CLUSS Algorithm<ref>{{cite web|url=https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-8-286|title=CLUSS Algorithm : Clustering non-alignable protein sequences|work=prospectus.usherbrooke.ca}}</ref> to identify groups of structurally, functionally, or evolutionarily related hard-to-align protein sequences. CLUSS webserver <ref name="prospectus.usherbrooke.ca">http://prospectus.usherbrooke.ca/CLUSS/</ref>
* CLUSS2 Algorithm<ref>{{cite web|url=https://www.inderscienceonline.com/doi/abs/10.1504/IJCBDD.2008.02019|title=CLUSS2 : Alignment-independent algorithm for clustering protein families with multiple biological functions|work=www.inderscienceonline.com}}</ref> for clustering families of hard-to-align protein sequences with multiple biological functions. CLUSS2 webserver <ref>http:// name="prospectus.usherbrooke.ca"/CLUSS/</ref>
<!-- Lets try the above (although both are wobbly) -->
<!-- * [http://bio.cc/RSDB RSDB] broken link -->