Revision as of 05:38, 12 December 2020 edit Monkbot (talk \| contribs) Bots 3,695,952 edits m Task 18 (cosmetic): eval 26 templates: del empty params (2×); hyphenate params (6×); Tag: AWB ← Previous edit		Revision as of 17:11, 28 March 2021 edit undo JCW-CleanerBot (talk \| contribs) Bots 136,899 edits m task, replaced: Bioinformatics (Oxford, England) → Bioinformatics (2) Tag: AWB Next edit →
Line 1: In [[bioinformatics]], '''sequence clustering''' [[algorithm]]s attempt to group [[biological sequence]]s that are somehow related. The sequences can be either of [[genomic]], "[[transcriptome\|transcriptomic]]" ([[expressed sequence tag\|ESTs]]) or [[protein]] origin. For proteins, [[~~Homologous~~homologous sequence~~\|homologous sequences~~]]s are typically grouped into [[protein family\|families]]. For EST data, clustering is important to group sequences originating from the same [[gene]] before the ESTs are [[sequence assembly\|assembled]] to reconstruct the original [[mRNA]]. Some clustering algorithms use [[single-linkage clustering]], constructing a [[transitive closure]] of sequences with a [[sequence similarity\|similarity]] over a particular threshold. UCLUST<ref name=usearch>{{cite web\|url=http://www.drive5.com/usearch\|title=USEARCH\|work=drive5.com}}</ref> and CD-HIT<ref name=cdhit>{{cite web\|url=http://cd-hit.org\|title=CD-HIT: a ultra-fast method for clustering protein and nucleotide sequences, with many new applications in next generation sequencing (NGS) data\|work=cd-hit.org}}</ref> use a [[greedy algorithm]] that identifies a [[representative sequences\|representative sequence]] for each cluster and assigns a new sequence to that cluster if it is sufficiently similar to the representative; if a sequence is not matched then it becomes the representative sequence for a new cluster. The similarity score is often based on [[sequence alignment]]. Sequence clustering is often used to make a [[Non redundant sequence\|non-redundant]] set of [[representative sequences]]. Line 10: * CD-HIT<ref name=cdhit/> * [[UCLUST]] in USEARCH<ref name=usearch/> * Starcode:<ref>{{cite web\|url=https://github.com/gui11aume/starcode\|title=Starcode repository\|date=2018-10-11}}</ref> a fast sequence clustering algorithm based on exact all-pairs search.<ref name="pmid25638815">{{cite journal \| vauthors = Zorita E, Cuscó P, Filion GJ \| title = Starcode: sequence clustering based on all-pairs search \| journal = Bioinformatics ~~(Oxford, England)~~ \| volume = 31 \| issue = 12 \| pages = 1913–9 \| date = June 2015 \| pmid = 25638815 \| pmc = 4765884 \| doi = 10.1093/bioinformatics/btv053 }}</ref> * OrthoFinder:<ref>{{cite web\|url=http://www.stevekellylab.com/software/orthofinder\|title=OrthoFinder\|work=Steve Kelly Lab}}</ref> a fast, scalable and accurate method for clustering proteins into gene families (orthogroups)<ref name="pmid26243257">{{cite journal \| vauthors = Emms DM, Kelly S \| title = OrthoFinder: solving fundamental biases in whole genome comparisons dramatically improves orthogroup inference accuracy \| journal = Genome Biology \| volume = 16 \| pages = 157 \| date = August 2015 \| pmid = 26243257 \| pmc = 4531804 \| doi = 10.1186/s13059-015-0721-2 }}</ref><ref name="pmid31727128">{{cite journal \| vauthors = Emms DM, Kelly S \| title = OrthoFinder: phylogenetic orthology inference for comparative genomics \| journal = Genome Biology \| volume = 20 \| issue = 1 \| pages = 238 \| date = November 2019 \| pmid = 31727128 \| pmc = 6857279 \| doi = 10.1186/s13059-019-1832-y }}</ref> * Linclust:<ref name="pmid29959318">{{cite journal \| vauthors = Steinegger M, Söding J \| title = Clustering huge protein sequence sets in linear time \| journal = Nature Communications \| volume = 9 \| issue = 1 \| pages = 2542 \| date = June 2018 \| pmid = 29959318 \| pmc = 6026198 \| doi = 10.1038/s41467-018-04964-5 \| bibcode = 2018NatCo...9.2542S }}</ref> first algorithm whose runtime scales linearly with input set size, very fast, part of [http://mmseqs.org/ MMseqs2]<ref name="pmid29035372">{{cite journal \| vauthors = Steinegger M, Söding J \| title = MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets \| journal = Nature Biotechnology \| volume = 35 \| issue = 11 \| pages = 1026–1028 \| date = November 2017 \| pmid = 29035372 \| doi = 10.1038/nbt.3988 \| hdl = 11858/00-001M-0000-002E-1967-3 \| s2cid = 402352 \| hdl-access = free }}</ref> software suite for fast, sensitive sequence searching and clustering of large sequence sets Line 31: == Non-redundant sequence databases == * PISCES: A Protein Sequence Culling Server<ref>{{cite web\|url=http://dunbrack.fccc.edu/pisces/\|title=Dunbrack Lab\|work=fccc.edu}}</ref> * RDB90<ref name=rdb90>{{cite journal \| vauthors = Holm L, Sander C \| title = Removing near-neighbour redundancy from large protein sequence collections \| journal = Bioinformatics ~~(Oxford, England)~~ \| volume = 14 \| issue = 5 \| pages = 423–9 \| date = June 1998 \| pmid = 9682055 \| doi = 10.1093/bioinformatics/14.5.423 \| doi-access = free }}</ref> * UniRef: A non-redundant [[UniProt]] sequence database<ref>{{cite web\|url=https://www.uniprot.org/database/DBDescription.shtml#uniref\|title=About UniProt\|work=uniprot.org}}</ref> * Uniclust: A clustered UniProtKB sequences at the level of 90%, 50% and 30% pairwise sequence identity.<ref name="pmid27899574">{{cite journal \| vauthors = Mirdita M, von den Driesch L, Galiez C, Martin MJ, Söding J, Steinegger M \| title = Uniclust databases of clustered and deeply annotated protein sequences and alignments \| journal = Nucleic Acids Research \| volume = 45 \| issue = D1 \| pages = D170–D176 \| date = January 2017 \| pmid = 27899574 \| pmc = 5614098 \| doi = 10.1093/nar/gkw1081 }}</ref>

Sequence clustering: Difference between revisions