Change search
ReferencesLink to record
Permanent link

Direct link
PonIC: Using Stratosphere to Speed Up Pig Analytics
KTH, School of Information and Communication Technology (ICT), Software and Computer systems, SCS.ORCID iD: 0000-0001-8219-4862
KTH, School of Information and Communication Technology (ICT), Software and Computer systems, SCS.
Swedish Institute of Computer Science (SICS), Kista, Sweden.
2013 (English)In: Euro-Par 2013 Parallel Processing: 19th International Conference, Aachen, Germany, August 26-30, 2013. Proceedings, Springer Berlin/Heidelberg, 2013, 279-290 p.Conference paper (Refereed)
Abstract [en]

Pig, a high-level dataflow system built on top of Hadoop MapReduce, has greatly facilitated the implementation of data-intensive applications. Pig successfully manages to conceal Hadoop’s one input and two-stage inflexible pipeline limitations, by translating scripts into MapReduce jobs. However, these limitations are still present in the backend, often resulting in inefficient execution.Stratosphere, a data-parallel computing framework consisting of PACT, an extension to the MapReduce programming model and the Nephele execution engine, overcomes several limitations of Hadoop MapReduce. In this paper, we argue that Pig can highly benefit from using Stratosphere as the backend system and gain performance, without any loss of expressiveness.We have ported Pig on top of Stratosphere and we present a process for translating Pig Latin scripts into PACT programs. Our evaluation shows that Pig Latin scripts can execute on our prototype up to 8 times faster for a certain class of applications.

Place, publisher, year, edition, pages
Springer Berlin/Heidelberg, 2013. 279-290 p.
, Lecture Notes in Computer Science, ISSN 0302-9743 ; 8097
Keyword [en]
big data, data analytics, programming systems
National Category
Computer Systems
URN: urn:nbn:se:kth:diva-129072DOI: 10.1007/978-3-642-40047-6_30ISI: 000341243100030ScopusID: 2-s2.0-84883160941ISBN: 978-3-642-40046-9OAI: diva2:649882
19th International Conference on Parallel Processing, Euro-Par 2013; Aachen, Germany, 26 August - 30 August 2013
SSF project End-to-End Clouds (E2E Cloouds)Erasmus Mundus Joint Doctorate in Distributed Computing (EMJD-DC)
Swedish Foundation for Strategic Research , RIT10-0043

QC 20131017

Available from: 2013-09-19 Created: 2013-09-19 Last updated: 2014-10-03Bibliographically approved
In thesis
1. Performance Optimization Techniques and Tools for Data-Intensive Computation Platforms: An Overview of Performance Limitations in Big Data Systems and Proposed Optimizations
Open this publication in new window or tab >>Performance Optimization Techniques and Tools for Data-Intensive Computation Platforms: An Overview of Performance Limitations in Big Data Systems and Proposed Optimizations
2014 (English)Licentiate thesis, comprehensive summary (Other academic)
Abstract [en]

Big data processing has recently gained a lot of attention both from academia and industry. The term refers to tools, methods, techniques and frameworks built to collect, store, process and analyze massive amounts of data. Big data can be structured, unstructured or semi-structured. Data is generated from various different sources and can arrive in the system at various rates. In order to process these large amounts of heterogeneous data in an inexpensive and efficient way, massive parallelism is often used. The common architecture of a big data processing system consists of a shared-nothing cluster of commodity machines. However, even in such a highly parallel setting, processing is often very time-consuming. Applications may take up to hours or even days to produce useful results, making interactive analysis and debugging cumbersome.

One of the main problems is that good performance requires both good data locality and good resource utilization. A characteristic of big data analytics is that the amount of data that is processed is typically large in comparison with the amount of computation done on it. In this case, processing can benefit from data locality, which can be achieved by moving the computation close the to data, rather than vice versa. Good utilization of resources means that the data processing is done with maximal parallelization. Both locality and resource utilization are aspects of the programming framework’s runtime system. Requiring the programmer to work explicitly with parallel process creation and process placement is not desirable. Thus, specifying good optimization that would relieve the programmer from low-level, error-prone instrumentation to achieve good performance is essential.

The main goal of this thesis is to study, design and implement performance optimizations for big data frameworks. This work contributes methods and techniques to build tools for easy and efficient processing of very large data sets. It describes ways to make systems faster, by inventing ways to shorten job completion times. Another major goal is to facilitate the application development in distributed data-intensive computation platforms and make big-data analytics accessible to non-experts, so that users with limited programming experience can benefit from analyzing enormous datasets.

The thesis provides results from a study of existing optimizations in MapReduce and Hadoop related systems. The study presents a comparison and classification of existing systems, based on their main contribution. It then summarizes the current state of the research field and identifies trends and open issues, while also providing our vision on future directions.

Next, this thesis presents a set of performance optimization techniques and corresponding tools fordata-intensive computing platforms;

PonIC, a project that ports the high-level dataflow framework Pig, on top of the data-parallel computing framework Stratosphere. The results of this work show that Pig can highly benefit from using Stratosphereas the backend system and gain performance, without any loss of expressiveness. The work also identifies the features of Pig that negatively impact execution time and presents a way of integrating Pig with different backends.

HOP-S, a system that uses in-memory random sampling to return approximate, yet accurate query answers. It uses a simple, yet efficient random sampling technique implementation, which significantly improves the accuracy of online aggregation.

An optimization that exploits computation redundancy in analysis programs and m2r2, a system that stores intermediate results and uses plan matching and rewriting in order to reuse results in future queries. Our prototype on top of the Pig framework demonstrates significantly reduced query response times.

Finally, an optimization framework for iterative fixed points, which exploits asymmetry in large-scale graph analysis. The framework uses a mathematical model to explain several optimizations and to formally specify the conditions under which, optimized iterative algorithms are equivalent to the general solution.

Place, publisher, year, edition, pages
Stockholm: KTH Royal Institute of Technology, 2014. 37 p.
TRITA-ICT-ECS AVH, ISSN 1653-6363 ; 14:11
performance optimization, data-intensive computing, big data
National Category
Engineering and Technology
Research subject
Information and Communication Technology
urn:nbn:se:kth:diva-145329 (URN)978-91-7595-143-0 (ISBN)
2014-06-11, Sal D, KTH - ICT, Isafjordsgatan 39, Kista, 10:00 (English)

QC 20140605

Available from: 2014-06-05 Created: 2014-05-16 Last updated: 2014-06-05Bibliographically approved

Open Access in DiVA

No full text

Other links

Publisher's full textScopusSpringer Link

Search in DiVA

By author/editor
Kalavri, VasilikiVlassov, Vladimir
By organisation
Software and Computer systems, SCS
Computer Systems

Search outside of DiVA

GoogleGoogle Scholar
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

Altmetric score

Total: 58 hits
ReferencesLink to record
Permanent link

Direct link