Content deleted Content added
No edit summary |
m Open access bot: pmc updated in citation with #oabot. |
||
(42 intermediate revisions by 15 users not shown) | |||
Line 1:
{{Short description|Metric of clustering solutions quality}}
[[File:DBCV clustering evaluation.png|thumb|500px|In each graph, an increasing level of noise is introduced to the initial data, which consist of two well-defined semicircles. As the noise increases and thus the overlap between the two groups, the value of the DBCV index progressively decreases. Image released under MIT license.<ref name = felsiq>GitHub.
FelSiq/DBCV Fast Density-Based Clustering Validation (DBCV) Python
package -- https://github.com/FelSiq/DBCV</ref>]]
'''Density-Based Clustering Validation (DBCV)''' is a metric designed to assess the quality of clustering solutions, particularly for density-based clustering algorithms like [[DBSCAN]], [[Mean shift]], and [[OPTICS]].
This metric is particularly suited for identifying concave and nested clusters, where traditional metrics such as the [[Silhouette (clustering)|Silhouette coefficient]], [[Davies–Bouldin index]], or [[Calinski–Harabasz index]] often struggle to provide meaningful evaluations.
Unlike traditional validation measures, which often rely on compact and well-separated clusters, DBCV index evaluates how well clusters are defined in terms of local density variations and structural coherence.
This metric was introduced in 2014
|
|
| last2 = Jaskowiak
| first2 = Pablo A.
| last3 = Campello
| first3 = Ricardo J. G. B.
| last4 = Zimek
| first4 = Arthur
| last5 = Sander
| first5 = Jörg
| chapter = Density-Based Clustering Validation
| year = 2014
| title = Proceedings of the 2014 SIAM International Conference on Data Mining
| doi = 10.1137/1.9781611973440.96
| pages = 839–847
| publisher = SIAM
| isbn = 978-1-61197-344-0
| url = https://www.dbs.ifi.lmu.de/~zimek/publications/SDM2014/DBCV.pdf
}}</ref>
The DBCV index has been employed for clustering analysis in bioinformatics,<ref name="Di Giovanni">{{Citation
| last= Di Giovanni
| first= Daniele
| year= 2023
| title= Using machine learning to explore shared genetic pathways and possible endophenotypes in autism spectrum disorder
| journal= Genes
| volume= 14
| issue= 2
| page= 313
| doi = 10.3390/genes14020313
| doi-access= free
| pmid= 36833240
| pmc= 9956345
}}</ref> ecology,<ref name="Poutaraud">{{Citation
| last= Poutaraud
| first= Joachim
| year= 2024
| title= Meta-Embedded Clustering (MEC): A new method for improving clustering quality in unlabeled bird sound datasets
| journal = Ecological Informatics
| volume= 82
| pages = 102687
| publisher = Elsevier
| doi = 10.1016/j.ecoinf.2024.102687
| doi-access= free
}}</ref> techno-economy,<ref name="Shim">{{Citation
| last= Shim
| first= Jaehyun
| year= 2022
| title= Techno-economic analysis of micro-grid system design through climate region clustering
| journal = Energy Conversion and Management
| volume= 274
| pages = 116411
| publisher = Elsevier
| doi = 10.1016/j.enconman.2022.116411
| bibcode= 2022ECM...27416411S
| url = https://www.sciencedirect.com/science/article/abs/pii/S019689042201189X
| url-access= subscription
}}</ref> and health informatics<ref name="Martinez">{{Citation
| last= Martínez
| first= Rubén Yáñez
| year= 2023
| title= Spanish Corpora of tweets about COVID-19 vaccination for automatic stance detection
| journal = Information Processing & Management
| volume= 60
| issue= 3
| pages = 103294
| publisher = Elsevier
| doi = 10.1016/j.ipm.2023.103294
| doi-access= free
}}</ref>
<ref>{{cite journal |
author= Chicco D. |
author2= Oneto L. |
author3= Cangelosi D. |
title = DBSCAN and DBCV application to open medical records heterogeneous data for identifying clinically significant clusters of patients with neuroblastoma |
journal = BioData Mining |
volume = 18 |
issue = 40 |
date = 2025 |
page = 1-17 |
doi = 10.1186/s13040-025-00455-8 |
doi-access=free|
pmc = 12164137 }}</ref>, as well as in numerous other fields.<ref name="Beer">{{cite arXiv |mode=cs2
| last= Beer
| first= Anna
| year= 2025
| title= DISCO: Internal Evaluation of Density-Based Clustering
| class= cs.LG
| eprint = 2503.00127
}}</ref>
<ref name="Veigel">{{Citation
| last= Veigel
| first= Nadja
| year= 2025
| title= Content analysis of multi-annual time series of flood-related Twitter (X) data
| journal = Natural Hazards and Earth System Sciences
| volume= 25
| issue= 2
| pages = 879–891
| publisher = Copernicus Publications Gottingen, Germany
| doi = 10.5194/nhess-25-879-2025
| doi-access= free
| bibcode= 2025NHESS..25..879V
| url = https://nhess.copernicus.org/articles/25/879/2025/
}}</ref>
== Definition ==
DBCV index evaluates clustering structures by analyzing the relationships between data points within and across clusters. Given a dataset <math>X = {x_1,x_2,...,x_n}</math>, a density-based algorithm partitions it into ''K '' clusters <math>{C_1,C_2,...,C_n}</math>. Each point belongs to a specific cluster, denoted as <math>Cluster(X_i)</math>
A key concept in DBCV index is the notion of density-connected paths.<ref>{{
| last = Ester
| first = M.
Line 35 ⟶ 126:
| title = Density-based Clustering
| journal = Encyclopedia of Database Systems
| pages = 795–799
| editor1-last = Liu
| editor1-first = L.
Line 44 ⟶ 136:
| doi = 10.1007/978-0-387-39940-9_605
| url = https://doi.org/10.1007/978-0-387-39940-9_605
| url-access= subscription
}}</ref> Two points within the same cluster are considered density-connected if there exists a sequence of intermediate points linking them, where each consecutive pair meets a predefined density criterion. The '''density-based distance''' between two points is determined by identifying the optimal path that minimizes the maximum local reachability distance along its trajectory.
DBCV index extends the [[Silhouette (clustering)|Silhouette coefficient]] by redefining cluster cohesion and separation using density-based distances:
* '''Within-cluster density distance''' measures how closely a point is related to other members of its cluster:
<math>
Line 56 ⟶ 147:
</math>
* '''Nearest-cluster density distance''' quantifies how far a point is from the closest external cluster:
<math>
b_i = \min_{{C \neq C_{\text{cluster}(x_i)} \atop C \in \{C_1,\dots,C_k\}}}
</math>
Using these measures, the '''DBCV index''' is computed as:
Line 73 ⟶ 162:
== Explanation ==
DBCV index values range between
* +1: Strongly cohesive and well-separated clusters.
* 0: Ambiguous clustering structure.
*
By leveraging density-based distances instead of traditional [[Euclidean distance|Euclidean measures]], DBCV index provides a more robust evaluation of clustering performance in datasets with irregular or non-spherical distributions.<ref name = Moulavi />
== References ==
*{{Citation
| last1 = Moulavi
| first1 = David
| last2 = Jaskowiak
| first2 = Pablo A.
| last3 = Campello
| first3 = Ricardo J. G. B.
| last4 = Zimek
| first4 = Arthur
| last5 = Sander
| first5 = Jörg
| chapter = Density-based clustering validation
| year = 2014
| title = Proceedings of the 2014 SIAM International Conference on Data Mining
| doi = 10.1137/1.9781611973440.96
| pages = 839–847
| publisher = SIAM
| isbn = 978-1-61197-344-0
| url = https://www.dbs.ifi.lmu.de/~zimek/publications/SDM2014/DBCV.pdf
| doi-access=free
}}
*{{Citation
| last1 = Chicco
| first1 = Davide
| last2 = Sabino
| first2 = Giuseppe
| last3 = Oneto
| first3 = Luca
| last4 = Jurman
| first4 = Giuseppe
| chapter = The DBCV index is more informative than DCSI, CDbw, and VIASCKDE indices for unsupervised clustering internal assessment of concave-shaped and density-based clusters
| year = 2025
| title = PeerJ Computer Science
| doi = 10.7717/peerj-cs.3095
| pages = 1-37
| publisher = PeerJ Inc.
| url = https://doi.org/10.7717/peerj-cs.3095
| doi-access=free
}}
== Implementations ==
* [https://github.com/FelSiq/DBCV Python DBCV Implementation by Felipe Alves Siqueira]
* [https://
== See also ==
Line 102 ⟶ 227:
== References ==
<references/>
{{Machine learning evaluation metrics}}
[[Category:Cluster analysis]]
|