Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE credits
In the last years Hadoop has been used as a standard backend for big data applications. Its most known application MapReduce provides a powerful parallel programming paradigm. Big companies, storing petabytes of data, like Facebook and Yahoo deployed their own Hadoop distribution for data analytics, interactive services etc. Nevertheless MapReduce’s simplicity in its map stage always leads to a full data scan of the input data and thus potentially wastes resources.
Recently new sources of big data, e.g. the 4k video format or genomic data, have appeared. Genomic data in its raw file format (FastQ) can take up to hundreds of gigabytes per file. Simply using MapReduce for a population analysis would easily end up in a full data scan on terabytes of data. Obviously there is a need for more efficient ways of accessing the data by reducing the amount of data, considered for the computation. Already existing approaches introduce indexing structures into their respective Hadoop distribution. While some of them are specifically made for certain data structures, e.g. key-value pairs, others strongly depend on the existence of a MapReduce framework. To overcome these problems we integrated an indexing structure into Hadoop’s file system, the Hadoop Distributed File System (HDFS), working independently of MapReduce. This structure supports the definition of own input formats and individual indexing strategies. The building process of an index is integrated into the file writing processes and is independent of software, working in higher layers of Hadoop. As a proof-of-concept though MapReduce has been given the possibility to make use of these indexing structures by simply adding a new parameter to its job definition. A prototype and its evaluation will show the advantages of using those structures with genomic data (FastQ and SAM files) as a use case.
2014. , 67 p.