Viral Clustering: A Robust Method to Extract Structures in Heterogeneous Datasets
2016 (English)Conference paper (Refereed)
Cluster validation constitutes one of the most challenging problems in unsupervised cluster analysis. For example, identifying the true number of clusters present in a dataset has been investigated for decades, and is still puzzling researchers today. The difficulty stems from the high variety of the dataset characteristics. Some datasets exhibit a strong structure with a few well-separated and normally distributed clusters, but most often real-world datasets contain possibly many overlapping non-gaussian clusters with heterogeneous variances and shapes. This calls for the design of robust clustering algorithms that could adapt to the structure of the data and in particular accurately guess the true number of clusters. They have recently been interesting attempts to design such algorithms, e.g. based on involved non-parametric statistical inference techniques. In this paper, we develop Viral Clustering (VC), a simple algorithm that jointly estimates the number of clusters and outputs clusters. The VC algorithm relies on two antagonist and interacting components. The first component tends to regroup neighbouring samples together, while the second component tends to spread samples in various clusters. This spreading component is performed using an analogy with the way virus spread over networks. We present extensive numerical experiments illustrating the robustness of the VC algorithm, and its superiority compared to existing algorithms.
Place, publisher, year, edition, pages
Association for the Advancement of Artificial Intelligence , 2016.
Clustering, K-means, Cluster Validation, Number of Clusters
Research subject Mathematics
IdentifiersURN: urn:nbn:se:kth:diva-181109OAI: oai:DiVA.org:kth-181109DiVA: diva2:898710
The Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16), February 12-17, Phoenix, USA
QC 201603232016-01-292016-01-292016-03-23Bibliographically approved