Change search
ReferencesLink to record
Permanent link

Direct link
Indexing Genomic Data on Hadoop
KTH, School of Information and Communication Technology (ICT).
2014 (English)Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
Abstract [en]

In the last years Hadoop has been used as a standard backend for big data applications. Its most known application MapReduce provides a powerful parallel programming paradigm. Big companies, storing petabytes of data, like Facebook and Yahoo deployed their own Hadoop distribution for data analytics, interactive services etc. Nevertheless MapReduce’s simplicity in its map stage always leads to a full data scan of the input data and thus potentially wastes resources.

Recently new sources of big data, e.g. the 4k video format or genomic data, have appeared. Genomic data in its raw file format (FastQ) can take up to hundreds of gigabytes per file. Simply using MapReduce for a population analysis would easily end up in a full data scan on terabytes of data. Obviously there is a need for more efficient ways of accessing the data by reducing the amount of data, considered for the computation. Already existing approaches introduce indexing structures into their respective Hadoop distribution. While some of them are specifically made for certain data structures, e.g. key-value pairs, others strongly depend on the existence of a MapReduce framework. To overcome these problems we integrated an indexing structure into Hadoop’s file system, the Hadoop Distributed File System (HDFS), working independently of MapReduce. This structure supports the definition of own input formats and individual indexing strategies. The building process of an index is integrated into the file writing processes and is independent of software, working in higher layers of Hadoop. As a proof-of-concept though MapReduce has been given the possibility to make use of these indexing structures by simply adding a new parameter to its job definition. A prototype and its evaluation will show the advantages of using those structures with genomic data (FastQ and SAM files) as a use case.

Place, publisher, year, edition, pages
2014. , 67 p.
TRITA-ICT-EX, 2014:111
National Category
Computer and Information Science
URN: urn:nbn:se:kth:diva-177298OAI: diva2:872155
Available from: 2015-12-01 Created: 2015-11-18 Last updated: 2015-12-01Bibliographically approved

Open Access in DiVA

No full text

By organisation
School of Information and Communication Technology (ICT)
Computer and Information Science

Search outside of DiVA

GoogleGoogle Scholar
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

Total: 48 hits
ReferencesLink to record
Permanent link

Direct link