Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Streaming Predictive Analytics on Apache Flink
KTH, School of Information and Communication Technology (ICT).
2015 (English)Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
Abstract [en]

Data analysis and predictive analytics today are driven by large scale dis- tributed deployments of complex pipelines, guiding data cleaning, model training and evaluation. A wide range of systems and tools provide the basic abstractions for building such complex pipelines for offline data processing, however, there is an increasing demand for providing support for incremental models over unbounded streaming data. In this work, we focus on the prob- lem of modelling such a pipeline framework and providing algorithms that build on top of basic abstractions, fundamental to stream processing. We design a streaming machine learning pipeline as a series of stages such as model building, concept drift detection and continuous evaluation. We build our prototype on Apache Flink, a distributed data processing system with streaming capabilities along with a state-of-the-art implementation of a varia- tion of Vertical Hoeffding Tree (VHT), a distributed decision tree classification algorithm as a proof of concept.

Furthermore, we compare our version of VHT with the current state-of- the-art implementations on distributed data processing systems in terms of performance and accuracy. Our experimental results on real-world data sets show significant performance benefits of our pipeline while maintaining low classification error. We believe that this pipeline framework can offer a good baseline for a full-fledged implementation of various streaming algorithms that can work in parallel.

Abstract [sv]

Dataanalys och predictive analytics drivs idag av storskaliga distribuerade distributioner av komplexa pipelines, guiding data cleaning, model training och utvärdering. Ett brett utbud av system och verktyg ger endast grundläggande abstractions (struktur) för att bygga sådana komplexa pipelines för databehandling i off-line läge, men det finns en ökande efterfrågan att tillhandahålla stöd för stegvis modell över unbounded streaming data. I detta arbete fokuserar vi på problemet med modellering som ramverket för pipeline och ger algoritmer som bygger på grundläggande abstraktioner för stream processing. Vi konstruerar en streaming maskininlärnings pipeline som innehåller steg som model building, concept drift detection och kontinuerlig utvärdering. Vi bygger vår prototyp på Apache Flink, ett distribuerat databehandlingssystem med strömnings kapacitet tillsammans med den bästa tillgängliga implementation av en Vertical Hoeffding Tree (VHT) variant och ett distribuerat beslutsträd algoritm som koncepttest.

Dessutom jämför vi vår version av VHT med den senaste tekniken inom destributed data processing systems i termer av prestanda och precision. Vårt experimentella resultaten visar betydande fördelarna med vår pipeline och samtidigt bibehållen låg klassificerat felet. Vi anser att detta ramverk kan erbjuda en bra utgångspunkt vid genomförandet av olika streaming algoritmer som kan arbeta parallellt.

Place, publisher, year, edition, pages
2015. , 81 p.
Series
TRITA-ICT-EX, 2015:188
Keyword [en]
analytics, streaming
National Category
Engineering and Technology
Identifiers
URN: urn:nbn:se:kth:diva-171355OAI: oai:DiVA.org:kth-171355DiVA: diva2:843219
Subject / course
Information and Communication Technology
Educational program
Master of Science - Distributed Computing
Presentation
2015-07-08, Von Neumann seminar room, SICS, Kistagången 16, Electrum, Stockholm, 18:21 (English)
Supervisors
Examiners
Available from: 2015-07-28 Created: 2015-07-27 Last updated: 2016-05-11Bibliographically approved

Open Access in DiVA

fulltext(6614 kB)2544 downloads
File information
File name FULLTEXT01.pdfFile size 6614 kBChecksum SHA-512
c48115282d3b46de793ab5ab0f83ae74e3d49593d35d29747beb68e5392cbc63b1d40fed90ba599bd0f549c9e6ef767c80657ea41d76f70020647608c12256a8
Type fulltextMimetype application/pdf

By organisation
School of Information and Communication Technology (ICT)
Engineering and Technology

Search outside of DiVA

GoogleGoogle Scholar
Total: 2544 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 3511 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf