Density-based clustering validation: Difference between revisions

Content deleted Content added
RichBot (talk | contribs)
(Beep, Boop). I have removed a template which is not valid in Draftspace
OAbot (talk | contribs)
m Open access bot: pmc updated in citation with #oabot.
 
(33 intermediate revisions by 12 users not shown)
Line 1:
{{Short description|descriptionMetric of Density-Basedclustering Clustering Validation (DBCV) index, a clusteringsolutions metricquality}}
{{Draft topics|stem}}
{{AfC topic|stem}}
{{AfC submission|||ts=20250414153130|u=Giuseppe Sabino|ns=2}}
 
[[File:DBCV clustering evaluation.png|thumb|500px|In each graph, an increasing level of noise is introduced to the initial data, which consist of two well-defined semicircles. As the noise increases and thus the overlap between the two groups, the value of the DBCV index progressively decreases. Image released under MIT license.<ref name = felsiq>GitHub.
FelSiq/DBCV Fast Density-Based Clustering Validation (DBCV) Python
package -- https://github.com/FelSiq/DBCV</ref>]]
Line 13 ⟶ 10:
Unlike traditional validation measures, which often rely on compact and well-separated clusters, DBCV index evaluates how well clusters are defined in terms of local density variations and structural coherence.
 
This metric was introduced in 2014 by by David Moulavi and colleagues in their work.<ref name = Moulavi>{{CiteCitation
| last last1 = Moulavi
| first first1 = DavoudDavid
| last2 = Jaskowiak
| first2 = Pablo A.
| last3 = Campello
| first3 = Ricardo J. G. B.
| last4 = Zimek
| first4 = Arthur
| last5 = Sander
| first5 = Jörg
| chapter = Density-Based Clustering Validation
| year = 2014
| title = Proceedings of the 2014 SIAM International Conference on Data Mining
| title = Density-based clustering validation
| journal = Proceedings of the 2014 SIAM International Conference on Data Mining
| doi = 10.1137/1.9781611973440.96
| pages = 839–847
| publisher = SIAM
| isbn = 978-1-61197-344-0
| url = https://www.dbs.ifi.lmu.de/~zimek/publications/SDM2014/DBCV.pdf
}}</ref> It utilizes density connectivity principles to quantify clustering structures, making it especially effective at detecting arbitrarily shaped clusters in concave datasets, where traditional metrics may be less reliable.
 
The DBCV index has been employed infor bioinformaticsclustering analysis in bioinformatics,<ref name="Di Giovanni">{{CiteCitation
| last= Di Giovanni
| first= Daniele
Line 31 ⟶ 37:
| title= Using machine learning to explore shared genetic pathways and possible endophenotypes in autism spectrum disorder
| journal= Genes
| volume= 14
| issue= 2
| page= 313
| doi = 10.3390/genes14020313
| doi-access= free
| url = https://www.mdpi.com/2073-4425/14/2/313
| pmid= 36833240
}}</ref> ecology analysis,<ref name="Poutaraud">{{Cite
| pmc= 9956345
}}</ref> ecology,<ref name="Poutaraud">{{Citation
| last= Poutaraud
| first= Joachim
Line 39 ⟶ 50:
| title= Meta-Embedded Clustering (MEC): A new method for improving clustering quality in unlabeled bird sound datasets
| journal = Ecological Informatics
| volume= 82
| pages = 102687
| publisher = Elsevier
| doi = 10.1016/j.ecoinf.2024.102687
| doi-access= free
| url = https://www.sciencedirect.com/science/article/pii/S1574954124002292
}}</ref> techno-economic analysiseconomy,<ref name="Shim">{{CiteCitation
| last= Shim
| first= Jaehyun
Line 49 ⟶ 61:
| title= Techno-economic analysis of micro-grid system design through climate region clustering
| journal = Energy Conversion and Management
| volume= 274
| pages = 116411
| publisher = Elsevier
| doi = 10.1016/j.enconman.2022.116411
| bibcode= 2022ECM...27416411S
| url = https://www.sciencedirect.com/science/article/abs/pii/S019689042201189X
| url-access= subscription
}}</ref> and health informatics analysis<ref name="Martinez">{{Cite
}}</ref> and health informatics<ref name="Martinez">{{Citation
| last= Martínez
| first= Rubén Yáñez
| year= 2023
| title= Spanish Corpora of tweets about COVID-19 vaccination for automatic stance detection
| journal = Information Processing \& Management
| volume= 60
| issue= 3
| pages = 103294
| publisher = Elsevier
| doi = 10.1016/j.ipm.2023.103294
| doi-access= free
| url = https://www.sciencedirect.com/science/article/pii/S0306457323000316
}}</ref>
}}</ref> as well as in numerous other fields<ref name=Beer">{{Cite
<ref>{{cite journal |
author= Chicco D. |
author2= Oneto L. |
author3= Cangelosi D. |
title = DBSCAN and DBCV application to open medical records heterogeneous data for identifying clinically significant clusters of patients with neuroblastoma |
journal = BioData Mining |
volume = 18 |
issue = 40 |
date = 2025 |
page = 1-17 |
doi = 10.1186/s13040-025-00455-8 |
doi-access=free|
pmc = 12164137 }}</ref>, as well as in numerous other fields.<ref name="Beer">{{cite arXiv |mode=cs2
| last= Beer
| first= Anna
| year= 2025
| title= DISCO: Internal Evaluation of Density-Based Clustering
| class= cs.LG
| journal = arXiv preprint arXiv:2503.00127
| doieprint = 10.48550/arXiv.2503.00127
}}</ref>
| url = https://arxiv.org/abs/2503.00127
<ref name="Veigel">{{Citation
}}</ref>
<ref name="Veigel">{{Cite
| last= Veigel
| first= Nadja
Line 78 ⟶ 107:
| title= Content analysis of multi-annual time series of flood-related Twitter (X) data
| journal = Natural Hazards and Earth System Sciences
| volume= 25
| pages = 879--891
| issue= 2
| pages = 879–891
| publisher = Copernicus Publications Gottingen, Germany
| doi = 10.5194/nhess-25-879-2025
| doi-access= free
| bibcode= 2025NHESS..25..879V
| url = https://nhess.copernicus.org/articles/25/879/2025/
}}</ref>
Line 87 ⟶ 120:
DBCV index evaluates clustering structures by analyzing the relationships between data points within and across clusters. Given a dataset <math>X = {x_1,x_2,...,x_n}</math>, a density-based algorithm partitions it into ''K '' clusters <math>{C_1,C_2,...,C_n}</math>. Each point belongs to a specific cluster, denoted as <math>Cluster(X_i)</math>
 
A key concept in DBCV index is the notion of density-connected paths.<ref>{{CiteCitation
| last = Ester
| first = M.
Line 93 ⟶ 126:
| title = Density-based Clustering
| journal = Encyclopedia of Database Systems
| pages = 795–799
| editor1-last = Liu
| editor1-first = L.
Line 102 ⟶ 136:
| doi = 10.1007/978-0-387-39940-9_605
| url = https://doi.org/10.1007/978-0-387-39940-9_605
| url-access= subscription
}}</ref> Two points within the same cluster are considered density-connected if there exists a sequence of intermediate points linking them, where each consecutive pair meets a predefined density criterion. The '''density-based distance''' between two points is determined by identifying the optimal path that minimizes the maximum local reachability distance along its trajectory.
 
Line 127 ⟶ 162:
== Explanation ==
 
DBCV index values range between -1−1 and +1:
 
* +1: Strongly cohesive and well-separated clusters.
* 0: Ambiguous clustering structure.
* -1−1: Poorly formed clusters or incorrect assignments.
 
By leveraging density-based distances instead of traditional [[Euclidean distance|Euclidean measures]], DBCV index provides a more robust evaluation of clustering performance in datasets with irregular or non-spherical distributions.<ref name = Moulavi />
.
 
== ImplementationsReferences ==
*{{Citation
| last1 = Moulavi
| first1 = David
| last2 = Jaskowiak
| first2 = Pablo A.
| last3 = Campello
| first3 = Ricardo J. G. B.
| last4 = Zimek
| first4 = Arthur
| last5 = Sander
| first5 = Jörg
| chapter = Density-based clustering validation
| year = 2014
| title = Proceedings of the 2014 SIAM International Conference on Data Mining
| doi = 10.1137/1.9781611973440.96
| pages = 839–847
| publisher = SIAM
| isbn = 978-1-61197-344-0
| url = https://www.dbs.ifi.lmu.de/~zimek/publications/SDM2014/DBCV.pdf
| doi-access=free
}}
 
*{{Citation
* Python DBCV Implementation by Christopher Jennes<ref>https://github.com/christopherjenness/DBCV</ref>
| last1 = Chicco
* Python DBCV Implementation by Felipe Silva<ref>https://github.com/FelSiq/DBCV</ref>
| first1 = Davide
* R DBCV Implementation<ref>https://doi.org/10.32614/CRAN.package.DBCVindex</ref>
| last2 = Sabino
| first2 = Giuseppe
| last3 = Oneto
| first3 = Luca
| last4 = Jurman
| first4 = Giuseppe
| chapter = The DBCV index is more informative than DCSI, CDbw, and VIASCKDE indices for unsupervised clustering internal assessment of concave-shaped and density-based clusters
| year = 2025
| title = PeerJ Computer Science
| doi = 10.7717/peerj-cs.3095
| pages = 1-37
| publisher = PeerJ Inc.
| url = https://doi.org/10.7717/peerj-cs.3095
| doi-access=free
}}
 
== Implementations ==
* [https://github.com/FelSiq/DBCV Python DBCV Implementation by Felipe Alves Siqueira]
* [https://doi.org/10.32614/cran.package.dbcvindex R DBCV Implementation by Pablo Andretta Jaskowiak]
 
== See also ==
Line 154 ⟶ 228:
<references/>
 
{{Machine learning evaluation metrics}}
[[:Category:Cluster analysis]]
 
[[Category:Cluster analysis]]