Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE credits
The size of global electronic data in need of storage and retrieval is growing with an increasing rate. As a result of this growth, the development of technologies to process such data is a necessity. The data is developing in both complexity and connectivity, particularly for social networks. Connectivity of data means that the records to be stored are highly interdependent. Conventional relational databases are poorly suited for processing highly connected data. On the contrary, graph databases are inherently suited for such dependencies. This is mainly due to the fact that graph databases try to preserve locality and store adjacent records close to one another. This allows retrieval of adjacent elements, regardless of graph size, in constant time. Furthermore, with the everyday growth of data volume these databases won’t fit on single server any longer and need more (distributed) resources. Thus, partitioning of the data to be stored is required. Graph partitioning, based on its aim, can be divided into two major subcategories; a) Balanced partitioning where the aim is to find a predefined, N, number of equally sized clusters and b) Community detection where the aim is to find all underlying dense subgraphs. In this thesis we investigate and improve one particular graph partitioning algorithm, namely DiDiC, which is designed for balanced partitioning. DiDiC is short for diffusive and distributed graph partitioning. The algorithm is independently implemented in this thesis. The main testbeds of our work are real-world social network graphs such as Wikipedia or Facebook and synthetically generated graphs. DiDiC's various aspects and performance are further examined in different situations (i.e. types of graph) and using various sets of parameters (i.e. DiDiC hyperparameters). Our experiments indicate that DiDiC fails to partition the input graphs to the desired number of clusters in more than 70% of cases. In most failure cases it returns the whole graph as a single cluster. We noticed that the diffusive aspects of DiDiC is minimally constrained. Inspired by these observations, we propose two diffusive variants of the DiDiC to address this issue and consequently improve the success rate. This is done mainly by constraining the diffusive aspect of DiDiC. The modifications are straightforward to implement and can be easily incorporated into existing graph databases. We show our modifications consistently outperforms DiDiC with a margin of ~30% success rate in various scenarios. The different scenarios include various sizes of graphs, with different connectivity and structure of underlying clusters. Furthermore, we demonstrate the effectiveness 5 of DiDiC in discovering underlying high density regions of graph a.k.a. “community detection”. In fact, we show that it is more successful in “community detection” (60% success rate) than "balanced clustering" (35% success rate). Finally, we investigate the importance of random initialization of DiDiC algorithm. We observe, while different initialization (and keeping the best performing one) can help the final performance, there is a diminishing return when the algorithm is randomly initialized more than 4 times.