Techniques and applications of earlyapproximate results for big-dataanalytics
Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
The amount of data processed by large-scale data processing frameworks is overwhelming. To improve the efficiency, such frameworks employ data parallelization and similar techniques. However, the expectations are growing: near real-time data analysis is desired. MapReduce is one of the most common large-scale data processing models in the area. Due to the batch processing nature of this framework, results are returned after job execution is finished. With the growth of data, batch operating environment is not always preferred: large number of applications can take advantage of early approximate results. It was first addressed by the online aggregation technique, applied to the relational databases. Recently it has been adapted for the MapReduce programming model, but with a focus to technical rather than data processing details. In this thesis project we overview the techniques, which can enable early estimation of results. We propose several modifications of the MapReduce Online framework. We show that our proposed system design changes possess properties required for the accurate results estimation. We present an algorithm for data bias reduction and block-level sampling. Consequently, we describe the implementation of our proposed system design and evaluate it with a number of selected applications and datasets. With our system, a user can calculate the average temperature of the 100 GB weather dataset six times faster (in comparison to the complete job execution) with as low as 2% error.
Place, publisher, year, edition, pages
2013. , 74 p.
Computer and Information Science
IdentifiersURN: urn:nbn:se:kth:diva-141655OAI: oai:DiVA.org:kth-141655DiVA: diva2:697407
Master of Science - Distributed Computing
Vlassov, Vladimir, Associate Professor