Change search
ReferencesLink to record
Permanent link

Direct link
Techniques and applications of earlyapproximate results for big-dataanalytics
KTH, School of Information and Communication Technology (ICT).
2013 (English)Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
Abstract [en]

The amount of data processed by large-scale data processing frameworks is overwhelming. To improve the efficiency, such frameworks employ data parallelization and similar techniques. However, the expectations are growing: near real-time data analysis is desired. MapReduce is one of the most common large-scale data processing models in the area. Due to the batch processing nature of this framework, results are returned after job execution is finished. With the growth of data, batch operating environment is not always preferred: large number of applications can take advantage of early approximate results. It was first addressed by the online aggregation technique, applied to the relational databases. Recently it has been adapted for the MapReduce programming model, but with a focus to technical rather than data processing details. In this thesis project we overview the techniques, which can enable early estimation of results. We propose several modifications of the MapReduce Online framework. We show that our proposed system design changes possess properties required for the accurate results estimation. We present an algorithm for data bias reduction and block-level sampling. Consequently, we describe the implementation of our proposed system design and evaluate it with a number of selected applications and datasets. With our system, a user can calculate the average temperature of the 100 GB weather dataset six times faster (in comparison to the complete job execution) with as low as 2% error.

Place, publisher, year, edition, pages
2013. , 74 p.
TRITA-ICT-EX, 2013:185
National Category
Computer and Information Science
URN: urn:nbn:se:kth:diva-141655OAI: diva2:697407
Educational program
Master of Science - Distributed Computing
Available from: 2014-02-20 Created: 2014-02-18 Last updated: 2014-02-20Bibliographically approved

Open Access in DiVA

fulltext(9427 kB)109 downloads
File information
File name FULLTEXT01.pdfFile size 9427 kBChecksum SHA-512
Type fulltextMimetype application/pdf

By organisation
School of Information and Communication Technology (ICT)
Computer and Information Science

Search outside of DiVA

GoogleGoogle Scholar
Total: 109 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

Total: 103 hits
ReferencesLink to record
Permanent link

Direct link