kth.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Scalable Streaming Graph and Time Series Analysis Using Partitioning and Machine Learning
KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Software and Computer systems, SCS.ORCID iD: 0000-0001-6171-9586
2021 (English)Doctoral thesis, monograph (Other academic)
Abstract [en]

Recent years have witnessed a massive increase in the amount of data generated by the Internet of Things (IoT) and social media. Processing huge amounts of this data poses non-trivial challenges in terms of the hardware and performance requirements of modern-day applications. The data we are dealing with today is of massive scale, high intensity and comes in various forms. MapReduce was a popular and clever choice of handling big data using a distributed programming model, which made the processing of huge volumes of data possible using clusters of commodity machines. However, MapReduce was not a good fit for performing complex tasks, such as graph processing, iterative programs and machine learning. Modern data processing frameworks, that are being popularly used to process complex data and perform complex analysis tasks, overcome the shortcomings of MapReduce. Some of these popular frameworks include Apache Spark for batch and stream processing, Apache Flink for stream processing and Tensor Flow for machine learning.

In this thesis, we deal with complex analytics on data modeled as time series, graphs and streams. Time series are commonly used to represent temporal data generated by IoT sensors. Analysing and forecasting time series, i.e. extracting useful characteristics and statistics of data and predicting data, is useful for many fields that include, neuro-physiology, economics, environmental studies, transportation, etc. Another useful data representation we work with, are graphs. Graphs are complex data structures used to represent relational data in the form of vertices and edges. Graphs are present in various application domains, such as recommendation systems, road traffic analytics, web analysis, social media analysis. Due to the increasing size of graph data, a single machine is often not sufficient to process the complete graph. Therefore, the computation, as well as the data, must be distributed. Graph partitioning, the process of dividing graphs into subgraphs, is an essential step in distributed graph processing of large scale graphs because it enables parallel and distributed processing.

The majority of data generated from IoT and social media originates as a continuous stream, such as series of events from a social media network, time series generated from sensors, financial transactions, etc. The stream processing paradigm refers to the processing of data streaming that is continuous and possibly unbounded. Combining both graphs and streams leads to an interesting and rather challenging domain of streaming graph analytics. Graph streams refer to data that is modelled as a stream of edges or vertices with adjacency lists representing relations between entities of continuously evolving data generated by a single or multiple data sources. Streaming graph analytics is an emerging research field with great potential due to its capabilities of processing large graph streams with limited amounts of memory and low latency. 

In this dissertation, we present graph partitioning techniques for scalable streaming graph and time series analysis. First, we present and evaluate the use of data partitioning to enable data parallelism in order to address the challenge of scale in large spatial time series forecasting. We propose a graph partitioning technique for large scale spatial time series forecasting of road traffic as a use-case. Our experimental results on traffic density prediction for real-world sensor dataset using Long Short-Term Memory Neural Networks show that the partitioning-based models take 12x lower training time when run in parallel compared to the unpartitioned model of the entire road infrastructure. Furthermore, the partitioning-based models have 2x lower prediction error (RMSE) compared to the entire road model. Second, we showcase the practical usefulness of streaming graph analytics for large spatial time series analysis with the real-world task of traffic jam detection and reduction. We propose to apply streaming graph analytics by performing useful analytics on traffic data stream at scale with high throughput and low latency. Third, we study, evaluate, and compare the existing state-of-the-art streaming graph partitioning algorithms. We propose a uniform analysis framework built using Apache Flink to evaluate and compare partitioning features and characteristics of streaming graph partitioning methods. Finally, we present GCNSplit, a novel ML-driven streaming graph partitioning solution, that uses a small and constant in-memory state (bounded state) to partition (possibly unbounded) graph streams. Our results demonstrate that \ours provides high-throughput partitioning and can leverage data parallelism to sustain input rates of 100K edges/s. GCNSplit exhibits a partitioning quality, in terms of graph cuts and load balance, that matches that of the state-of-the-art HDRF (High Degree Replicated First) algorithm while storing three orders of magnitude smaller partitioning state.

Place, publisher, year, edition, pages
Stockholm: KTH Royal Institute of Technology, 2021. , p. 139
Series
TRITA-EECS-AVL ; 2021:55
Keywords [en]
Stream processing, graph processing, time series, big data, machine learning
National Category
Computer Systems
Research subject
Information and Communication Technology
Identifiers
URN: urn:nbn:se:kth:diva-301596ISBN: 978-91-7873-979-0 (print)OAI: oai:DiVA.org:kth-301596DiVA, id: diva2:1592329
Public defence
2021-10-04, https://kth-se.zoom.us/meeting/register/u5Uof-2gqzsvGNdRddTW4DKuhwT5hwiqKPUo, Sal C Electrum, Kistagången 16, Stockholm, 14:00 (English)
Opponent
Supervisors
Note

QC 20210909

Available from: 2021-09-09 Created: 2021-09-08 Last updated: 2023-03-06Bibliographically approved

Open Access in DiVA

fulltext(29252 kB)1238 downloads
File information
File name FULLTEXT01.pdfFile size 29252 kBChecksum SHA-512
12a3bdc939c49ddbde46ac15033763472cc0ef45e63ae08395cf54e4f52905b7572a4f9b31a0772cd701fb64e7ea045d2dc66ae82a9a702d08a2f84b6a8d8930
Type fulltextMimetype application/pdf

Authority records

Abbas, Zainab

Search in DiVA

By author/editor
Abbas, Zainab
By organisation
Software and Computer systems, SCS
Computer Systems

Search outside of DiVA

GoogleGoogle Scholar
Total: 1239 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 2461 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf