kth.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
WinBro: A Window and Broadcast-based Parallel Streaming Graph Partitioning Framework for Apache Flink
KTH, School of Electrical Engineering and Computer Science (EECS).
2019 (English)Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
Abstract [en]

The past years have shown an increasing demand to process data of various kinds and size in real-time. A common representation for many real-world scenarios is a graph, which shows relations between entities, such as users of social networks or pages on the Internet. These graphs increase in size over time and can easily exceed the capacity of single machines.Graph partitioning is used to divide graphs into multiple subgraphs on different servers. Traditional partitioning techniques work in an offline manner where the whole graph is processed before partitioning. Due to the recently increased demand for real-time analysis, online partitioning algorithms have been introduced. They are able to partition a graph arriving as a stream, also referred to as a streaming graph, without any pre-processing step.The goal of a good graph partitioning algorithm is to maintain the data locality and to balance partitions’ load at the same time. Although different algorithms have proven to achieve both goals for real-world graphs, they often require to maintain a state. However, modern stream processing systems, such as Apache Flink, work with a shared-nothing architecture in a data-parallel manner. Therefore, they do not allow to exchange information along with parallel computations. These systems usually use Hash-based partitioning, that is a fast stateless technique but ignores the graph structure. Hence, it can lead to longer analysis times for streaming applications which could benefit from preserved structures.This work aims to develop a state-sharing parallel streaming graph partitioner for Apache Flink, called WinBro, implementing well-performing partitioning algorithms. In order to do this, existing streaming graph algorithms are studied for possible implementation and then integrated into WinBro.For validation, different experiments were made with real-world graphs. In these experiments, the partitioning quality, and partitioning speed were measured. Moreover, the performance of different streaming applications using WinBro was measured and compared with the default Hash-based partitioning method.Results show that the new partitioner WinBro provides better partitioning quality in terms of data locality and also higher performance for applications with requirements for locality-based input data. Nonetheless, the Hash-based partitioner shows the highest throughput and better performance for data localityagnostic streaming applications.

Abstract [sv]

De senaste åren har det skett en ökande efterfrågan på att bearbeta data av olika sorter och storlek i realtid. En vanlig representation för många scenarier är diagram som visar relationer mellan enheter, till exempel användare av sociala nätverk eller sidor på Internet. Dessa grafers storlek ökar över tiden och kan enkelt överstiga kapaciteten hos individuella maskiner.Grafpartitionering används för att dividera grafer i flera delgrafer på olika servrar. Traditionella partitioneringstekniker fungerar offline, där hela grafen bearbetas före partitionering. Baserat på den nyligen ökade efterfrågan på realtidsanalys har online-partitionsalgoritmer introducerats. De kan partitionera en graf som kommer strömmande, även kallad ett strömmande diagram, utan förbehandling.Målet med en bra grafpartitioneringsalgoritm är att behålla datalokalitet och balansera partitionernas belastning samtidigt. Även om olika algoritmer har visat möjligheten att uppnå båda målen för realvärldsgrafik, behöver de ofta behålla ett tillstånd. Moderna strömbehandlingssystem, som Apache Flink, arbetar emellertid med en gemensam-ingenting-arkitektur på ett data-parallellt sätt. Därför tillåter de inte att utbyta information under parallella beräkningar. Dessa system brukar använda Hash-baserad partitionering, vilket är en snabb tillståndslös teknik men ignorerar grafstrukturen. Därför kan det leda till längre analystider för strömmande applikationer som kan dra nytta av bevarade strukturer.Detta arbete har som mål till att utveckla en tillstånsdsdelande, parallellströmmande grafpartitionering för Apache Flink, kallad WinBro, som implementerar välpresterande partitioneringsalgoritmer. För att nå målet studeras befintliga strömmande grafalgoritmer för möjlig implementering och sedan integreras i WinBro.För validering görs olika experiment med realvärldsgrafik. I våra experiment mäter vi partitioneringskvaliteten och partitioneringshastigheten. Dessutom kvantifierar vi prestanda för olika strömmande applikationer med WinBro och jämför den med en standard Hash-baserad partitionsmetod.Resultat visar att den nya partitionern WinBro ger bättre partitioneringskvalitet när det gäller datalokalitet och även högre prestanda för applikationer med krav på lokalitetsbaserad inmatningsdata. Alternativt visar den Hashbaserade partitionen den högsta genomströmningen och bättre prestanda för datalokalitets-agnostiska strömmande applikationer.

Place, publisher, year, edition, pages
2019. , p. 72
Series
TRITA-EECS-EX ; 2019:558
National Category
Computer and Information Sciences
Identifiers
URN: urn:nbn:se:kth:diva-258419OAI: oai:DiVA.org:kth-258419DiVA, id: diva2:1350001
External cooperation
RISE SICS
Supervisors
Examiners
Available from: 2019-09-10 Created: 2019-09-10 Last updated: 2023-03-06Bibliographically approved

Open Access in DiVA

fulltext(1460 kB)861 downloads
File information
File name FULLTEXT01.pdfFile size 1460 kBChecksum SHA-512
0540ead1494d37ef98330506809153328a168f9e3a6b979ae463f0a6602574984ab2b2beb9eb6f142b5292e74b4d0b8fda7e9aa25620872ee56278929f581cec
Type fulltextMimetype application/pdf

By organisation
School of Electrical Engineering and Computer Science (EECS)
Computer and Information Sciences

Search outside of DiVA

GoogleGoogle Scholar
Total: 861 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 890 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf