Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE credits
The amount of digital data in the new era has grown exponentially in recent years and with the development of new technologies, is growing more rapidly than ever before. Nevertheless, simply knowing that all these data are out there is easily understandable, utilizing these data to turn a profit is not trivial. The need of data mining techniques able to extract profitable insight information is the next frontier of innovation, competition and profit.
A data analytic services provider, in order to well-scale and exponentially grow its profit, has to deal with scalability, multi-tenancy and self-adaptability. In big data applications, machine learning is a very powerful instrument but a bad choice regarding the algorithm and its configuration parameters can easily lead to poor results. The key problem is automating the tuning process without a priori knowledge of the data and without human intervention.
In this research project we implemented and analysed TunUp: A Distributed Cloud-based Genetic Evolutionary Tuning for Data Clustering. The proposed solution automatically evaluates and tunes data clustering algorithms, so that big data services can self-adapt and scale in a cost-efficient manner.
For our experiments, we considered k-means as clustering algorithm, that is a simple but popular algorithm for data clustering, widely used in many data mining applications. Clustering outputs are evaluated using four internal techniques: AIC, Dunn, Davies-Bouldin and Silhouette and an external evaluation: AdjustedRand. We then perform a correlation t-test in order to validate and benchmark our internal techniques against AdjustedRand.
Defined the best evaluation criteria, the main challenge of k-means is setting the right value of k, that represents the number of clusters, and the distance measure used to compute distances of each pair of points in the data space. To address this problem we propose an implementation of the Genetic Evolutionary Algorithm that heuristically finds out an optimal configuration of our clustering algorithm. In order to improve performances, we implemented a parallel version of genetic algorithm developing a REST API and deploying several instances in the Amazon Cloud Computing (EC2) service.
In conclusion, with this research we contribute building and analysing TunUp, an open solution for evaluation, validation and tuning of data clustering algorithms. Our experiments show the quality and efficiency on k-means for a set of public datasets. The research also provides a Roadmap that gives indications of how the current system should be extended and utilized for future clustering applications.
2013. , 110 p.