Ground truth clustering is not the optimum clustering
2025 (English)In: Scientific Reports, E-ISSN 2045-2322, Vol. 15, no 1, article id 9223
Article in journal (Refereed) Published
Abstract [en]
Data clustering is a fundamental yet challenging task in data science. The minimum sum-of-squares clustering (MSSC) problem aims to partition data points into k clusters to minimize the sum of squared distances between the points and their cluster centers (centroids). Despite being NP-hard, solvers exist that can compute optimal solutions for small to medium-sized datasets. One such solver is SOS-SDP, a branch-and-bound algorithm based on semidefinite programming. We used it to obtain optimal MSSC solutions (optimum clusterings) for various k across multiple datasets with known ground truth clusterings. We evaluated the alignment between the optimum and ground truth clusterings using six extrinsic measures and assessed their quality using three intrinsic measures. The results reveal that the optimum clusterings often differ significantly from the ground truth clusterings. Additionally, the optimum clusterings frequently outperform the ground truth clusterings, according to the intrinsic measures that we used. However, when ground truth clusters are well-separated convex shapes, such as ellipsoids, the optimum and ground truth clusterings closely align.
Place, publisher, year, edition, pages
Springer Nature , 2025. Vol. 15, no 1, article id 9223
Keywords [en]
Extrinsic measures, Ground truth clustering, Intrinsic measures, Minimum sum-of-squares clustering
National Category
Computer graphics and computer vision
Identifiers
URN: urn:nbn:se:kth:diva-362013DOI: 10.1038/s41598-025-90865-9ISI: 001446949700011PubMedID: 40097499Scopus ID: 2-s2.0-105000375355OAI: oai:DiVA.org:kth-362013DiVA, id: diva2:1949686
Note
QC 20250425
2025-04-032025-04-032025-04-25Bibliographically approved