Change search
ReferencesLink to record
Permanent link

Direct link
Efficient Data Stream Sampling on Apache Flink
KTH, School of Computer Science and Communication (CSC). SICS.
2016 (English)Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesisAlternative title
Effektiv dataströmsampling med Apache Flink (Swedish)
Abstract [en]

Sampling is considered to be a core component of data analysis making it possibleto provide a synopsis of possibly large amounts of data by maintainingonly subsets or multisubsets of it. In the context of data streaming, an emergingprocessing paradigm where data is assumed to be unbounded, samplingoffers great potential since it can establish a representative bounded view ofinfinite data streams to any streaming operations. This further unlocks severalbenefits such as sustainable continuous execution on managed memory, trendsensitivity control and adaptive processing tailored to the operations that consumedata streams.The main aim of this thesis is to conduct an experimental study in order tocategorize existing sampling techniques over a selection of properties derivedfrom common streaming use cases. For that purpose we designed and implementeda testing framework that allows for configurable sampling policiesunder different processing scenarios along with a library of different samplersimplemented as operators. We build on Apache Flink, a distributed streamprocessing system to provide this testbed and all component implementationsof this study. Furthermore, we show in our experimental analysis that there isno optimal sampling technique for all operations. Instead, there are differentdemands across usage scenarios such as online aggregations and incrementalmachine learning. In principle, we show that each sampling policy trades offbias, sensitivity and concept drift adaptation, properties that can be potentiallypredefined by different operators.We believe that this study serves as the starting point towards automatedadaptive sampling selection for sustainable continuous analytics pipelines thatcan react to stream changes and thus offer the right data needed at each time,for any possible operation

Place, publisher, year, edition, pages
Keyword [en]
Sampling, Streaming, Apache Flink, Distributed Systems
National Category
Computer Science
URN: urn:nbn:se:kth:diva-183397OAI: diva2:910695
Educational program
Master of Science - Machine Learning
Available from: 2016-03-14 Created: 2016-03-09 Last updated: 2016-03-14Bibliographically approved

Open Access in DiVA

fulltext(2927 kB)196 downloads
File information
File name FULLTEXT01.pdfFile size 2927 kBChecksum SHA-512
Type fulltextMimetype application/pdf

By organisation
School of Computer Science and Communication (CSC)
Computer Science

Search outside of DiVA

GoogleGoogle Scholar
Total: 196 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

Total: 305 hits
ReferencesLink to record
Permanent link

Direct link