Block Sampling: Efficient Accurate Online Aggregation in MapReduce
2013 (English)In: Cloud Computing Technology and Science (CloudCom), 2013 IEEE 5th International Conference on, 2013, 250-257 p.Conference paper (Refereed)
Large-scale data processing frameworks, such as Hadoop MapReduce, are widely used to analyze enormous amounts of data. However, processing is often time-consuming, preventing interactive analysis. One way to decrease response time is partial job execution, where an approximate, early result becomes available to the user, prior to job completion. The Hadoop Online Prototype (HOP) uses online aggregation to provide early results, by partially executing jobs on subsets of the input, using a simplistic progress metric. Due to its sequential nature, values are not objectively represented in the input subset, often resulting in poor approximations or "data bias". In this paper, we propose a block sampling technique for large-scale data processing, which can be used for fast and accurate partial job execution. Our implementation of the technique on top of HOP uniformly samples HDFS blocks and uses in-memory shuffling to reduce data bias. Our prototype significantly improves the accuracy of HOP's early results, while only introducing minimal overhead. We evaluate our technique using real-world datasets and applications and demonstrate that our system outperforms HOP in terms of accuracy. In particular, when estimating the average temperature of the studied dataset, our system provides high accuracy (less than 20% absolute error) after processing only 10% of the input, while HOP needs to process 70% of the input to yield comparable results.
Place, publisher, year, edition, pages
2013. 250-257 p.
MapReduce, online aggregation, sampling, approximate results
Engineering and Technology
IdentifiersURN: urn:nbn:se:kth:diva-145332DOI: 10.1109/CloudCom.2013.40ISI: 000352075800035ScopusID: 2-s2.0-84899738915OAI: oai:DiVA.org:kth-145332DiVA: diva2:717730
5th IEEE International Conference on Cloud Computing Technology and Science (CloudCom) 2013
QC 201406242014-05-162014-05-162014-06-24Bibliographically approved