Graph partitioning is an essential yet challenging task for massive graph analysis in distributed computing. Common graph partitioning methods scan the complete graph to obtain structural characteristics offline, before partitioning. However, the emerging need for low-latency, continuous graph analysis led to the development of online partitioning methods. Online methods ingest edges or vertices as a stream, making partitioning decisions on the fly based on partial knowledge of the graph. Prior studies have compared offline graph partitioning techniques across different systems. Yet, little effort has been put into investigating the characteristics of online graph partitioning strategies.
In this work, we describe and categorize online graph partitioning techniques based on their assumptions, objectives and costs. Furthermore, we employ an experimental comparison across different applications and datasets, using a unified distributed runtime based on Apache Flink. Our experimental results showcase that model-dependent online partitioning techniques such as low-cut algorithms offer better performance for communication-intensive applications such as bulk synchronous iterative algorithms, albeit higher partitioning costs. Otherwise, model-agnostic techniques trade off data locality for lower partitioning costs and balanced workloads which is beneficial when executing data-parallel single-pass graph algorithms.
Recent advances in reliable distributed computing have made it possible to provide high availability and scalability to traditional systems and thus serve them as reliable services. For some systems, their parallel nature in addition to weak consistency requirements allowed a more trivial transision such as distributed storage, online data analysis, batch processing and distributed stream processing. On the other hand, systems such as Complex Event Processing (CEP) still maintain a monolithic architecture, being able to offer high expressiveness at the expense of low distribution. In this work, we address the main challenges of providing a highly-available Distributed CEP service with a focus on reliability, since it is the most crucial and untouched aspect of that transition. The experimental solution presented targets low average detection latency and leverages event delegation mechanisms present on existing stream execution platforms and in-memory logging to provide availability of any complex event processing abstraction on top via redundancy and partial recovery.
Data-stream management systems have for long been considered as a promising architecture for fast data management. The stream processing paradigm poses an attractive means of declaring persistent application logic coupled with state over evolving data. However, despite contributions in programming semantics addressing certain aspects of data streaming, existing approaches have been lacking a clear, universal specification for the underlying system execution. We investigate the case of data stream processing as a general-purpose scalable computing architecture that can support continuous and iterative state-driven workloads. Furthermore, we examine how this architecture can enable the composition of reliable, reconfigurable services and complex applications that go even beyond the needs of scalable data analytics, a major trend in the past decade.
In this dissertation, we specify a set of core components and mechanisms to compose reliable data stream processing systems while adopting three crucial design principles: blocking-coordination avoidance, programming-model transparency, and compositionality. Furthermore, we identify the core open challenges among the academic and industrial state of the art and provide a complete solution using these design principles as a guide. Our contributions address the following problems: I) Reliable Execution and Stream State Management, II) Computation Sharing and Semantics for Stream Windows, and III) Iterative Data Streaming. Several parts of this work have been integrated into Apache Flink, a widely-used, open-source scalable computing framework, and supported the deployment of hundreds of long-running large-scale production pipelines worldwide.
Stream processors are emerging in industry as an apparatus that drives analytical but also mission critical services handling the core of persistent application logic. Thus, apart from scalability and low-latency, a rising system need is first-class support for application state together with strong consistency guarantees, and adaptivity to cluster reconfigurations, software patches and partial failures. Although prior systems research has addressed some of these specific problems, the practical challenge lies on how such guarantees can be materialized in a transparent, non-intrusive manner that relieves the user from unnecessary constraints. Such needs served as the main design principles of state management in Apache Flink, an open source, scalable stream processor.
We present Flink’s core pipelined, in-flight mechanism which guarantees the creation of lightweight, consistent, distributed snapshots of application state, progressively, without impacting continuous execution. Consistent snapshots cover all needs for system reconfiguration, fault tolerance and version management through coarse grained rollback recovery. Application state is declared explicitly to the system, allowing efficient partitioning and transparent commits to persistent storage. We further present Flink’s backend implementations and mechanisms for high availability, external state queries and output commit. Finally, we demonstrate how these mechanisms behave in practice with metrics and large deployment insights exhibiting the low performance trade-offs of our approach and the general benefits of exploiting asynchrony in continuous, yet sustainable system deployments.
Distributed stateful stream processing enables the deployment and execution of large scale continuous computations in the cloud, targeting both low latency and high throughput. One of the most fundamental challenges of this paradigm is providing processing guarantees under potential failures. Existing approaches rely on periodic global state snapshots that can be used for failure recovery. Those approaches suffer from two main drawbacks. First, they often stall the overall computation which impacts ingestion. Second, they eagerly persist all records in transit along with the operation states which results in larger snapshots than required. In this work we propose Asynchronous Barrier Snapshotting (ABS), a lightweight algorithm suited for modern dataflow execution engines that minimises space requirements. ABS persists only operator states on acyclic execution topologies while keeping a minimal record log on cyclic dataflows. We implemented ABS on Apache Flink, a distributed analytics engine that supports stateful stream processing. Our evaluation shows that our algorithm does not have a heavy impact on the execution, maintaining linear scalability and performing well with frequent snapshots.
In our data-centric society, online services, decision making, and other aspects are increasingly becoming heavily dependent on trends and patterns extracted from data. A broad class of societal-scale data management problems requires system support for processing unbounded data with low latency and high throughput. Large-scale data stream processing systems perceive data as infinite streams and are designed to satisfy such requirements. They have further evolved substantially both in terms of expressive programming model support and also efficient and durable runtime execution on commodity clusters. Expressive programming models offer convenient ways to declare continuous data properties and applied computations, while hiding details on how these data streams are physically processed and orchestrated in a distributed environment. Execution engines provide a runtime for such models further allowing for scalable yet durable execution of any declared computation. In this chapter we introduce the major design aspects of large scale data stream processing systems, covering programming model abstraction levels and runtime concerns. We then present a detailed case study on stateful stream processing with Apache Flink, an open-source stream processor that is used for a wide variety of processing tasks. Finally, we address the main challenges of disruptive applications that large-scale data streaming enables from a systemic point of view.
Aggregation queries on data streams are evaluated over evolving and often overlapping logical views called windows. While the aggregation of periodic windows were extensively studied in the past through the use of aggregate sharing techniques such as Panes and Pairs, little to no work has been put in optimizing the aggregation of very common, non-periodic windows. Typical examples of non-periodic windows are punctuations and sessions which can implement complex business logic and are often expressed as user-defined operators on platforms such as Google Dataflow or Apache Storm. The aggregation of such non-periodic or user-defined windows either falls back to expensive, best-effort aggregate sharing methods, or is not optimized at all.
In this paper we present a technique to perform efficient aggregate sharing for data stream windows, which are declared as user-defined functions (UDFs) and can contain arbitrary business logic. To this end, we first introduce the concept of User-Defined Windows (UDWs), a simple, UDF-based programming abstraction that allows users to programmatically define custom windows. We then define semantics for UDWs, based on which we design Cutty, a low-cost aggregate sharing technique. Cutty improves and outperforms the state of the art for aggregate sharing on single and multiple queries. Moreover, it enables aggregate sharing for a broad class of non-periodic UDWs. We implemented our techniques on Apache Flink, an open source stream processing system, and performed experiments demonstrating orders of magnitude of reduction in aggregation costs compared to the state of the art.
Recent advances in distributed computing have made it possible to achieve high availability on traditional systems and thus serve them as reliable services. For several offline computational applications, such as fine grained batch processing, their parallel nature in addition to weak consistency requirements allowed a more trivial transition. On the other hand, on-line processing systems such as Complex Event Processing (CEP) still maintain a monolithic architecture, being able to offer high expressiveness and vertical scalability at the expense of low distribution. Despite attempts to design dedicated distributed CEP systems there is potential for existing systems to benefit from a sustainable cloud deployment. In this work we address the main challenges of providing such a CEP service with a focus on reliability, since it is the most crucial aspect of that transition. Our approach targets low average detection latency and sustain-ability by leveraging event delegation mechanisms present on existing stream execution platforms. It also introduces redundancy and transactional logging to provide improved fault tolerance and partial recovery. Our performance analysis illustrates the benefits of our approach and shows acceptable performance costs for on-line CEP exhibited by the fault tolerance mechanisms we introduced.
The problem of automated personalised news recommendation, often referred as auto-scoring has attracted substantial research throughout the last decade in multiple domains such as data mining and machine learning, computer systems, e commerce and sociology. A typical "recommender systems" approach to solving this problem usually adopts content-based scoring, collaborative filtering or more often a hybrid approach. Due to their special nature, news articles introduce further challenges and constraints to conventional item recommendation problems, characterised by short lifetime and rapid popularity trends. In this survey, we provide an overview of the challenges and current solutions in news personalisation and ranking from both an algorithmic and system design perspective, and present our evaluation of the most representative scoring algorithms while also exploring the benefits of using a hybrid approach. Our evaluation is based on a real-life case study in news recommendations.
Executing queries on incomplete, sparse knowledge graphs yields incomplete results, especially when it comes to queries involving traversals. In this paper, we question the applicability of all known architectures for incomplete knowledge bases and propose ORB: a clear departure from existing system designs, relying on Machine Learning-based operators to provide inferred query results. At the same time, ORB addresses peculiarities inherent to knowledge graphs, such as schema evolution, dynamism, scalability, as well as high query complexity via the use of embedding-driven inference. Through ORB, we stress that approximating complex processing tasks is not only desirable but also imperative for knowledge graphs.
We present the first performance evaluation study of model serving integration tools in stream processing frameworks. Using Apache Flink as a representative stream processing system, we evaluate alternative Deep Learning serving pipelines for image classification. Our performance evaluation considers both the case of embedded use of Machine Learning libraries within stream tasks and that of external serving via Remote Procedure Calls. The results indicate superior throughput and scalability for pipelines that make use of embedded libraries to serve pre-trained models. Whereas, latency can vary across strategies, with external serving even achieving lower latency when network conditions are optimal due to better specialized use of underlying hardware. We discuss our findings and provide further motivating arguments towards research in the area of ML-native data streaming engines in the future.
Message-based programming frameworks facilitate the development and execution of core distributed computing algorithms today. Their twofold aim is to expose a programming model that minimises logical errors incurred during translation from an algorithmic specification to executable program, and also to provide an efficient runtime for event pattern-matching and scheduling of distributed components. Kompics Scala is a framework that allows for a direct, streamlined translation from a formal algorithm specification to practical code by reducing the cognitive gap between the two representations. Furthermore, its runtime decouples event pattern-matching and component execution logic yielding clean, thoroughly expected behaviours. Our evaluation shows low and constant performance overhead of Kompics Scala compared to similar frameworks that otherwise fail to offer the same level of model clarity.
Omni-Paxos is a system for state machine replication that is completely resilient to partial network partitions, a major source of service disruptions in recent years. Omni-Paxos achieves its resilience through a decoupled design that separates the execution and state of leader election from log replication. The leader election builds on the concept of quorum-connected servers, with the sole focus on connectivity. Additionally, by decoupling reconfiguration from log replication, Omni-Paxos provides flexible and parallel log migration that improves the performance and robustness of reconfiguration. Our evaluation showcases two benefits over state-of-the-art protocols: (1) guaranteed recovery in at most four election timeouts under extreme partial network partitions, and (2) up to 8x shorter reconfiguration periods with 46% less I/O at the leader.
Most of the world's cloud data service workloads are currently being backed by replicated state machines. Production-grade log replication protocols used for the job impose heavy data transfer duties on the primary server which need to disseminate the log commands to all the replica servers. UniCache proposes a principal solution to this problem using a learned replicated cache which enables commands to be sent over the network as compressed encodings. UniCache takes advantage of that each replica has access to a consistent prefix of the replicated log which allows them to build a uniform lookup cache used for compressing and decompressing commands consistently. UniCache achieves effective speedups, lowering the primary load in application workloads with a skewed data distribution. Our experimental studies showcase a low pre-processing overhead and the highest performance gains in cross-data center deployments over wide area networks.
Graph Neural Networks (GNNs) have recently achieved good performance in many predictive tasks involving graph-structured data. However, the majority of existing models consider static graphs only and do not support training on graph streams. While inductive representation learning can generate predictions for unseen vertices, these are only accurate if the learned graph structure and properties remain stable over time. In this paper, we study the problem of employing experience replay to enable continuous graph representation learning in the streaming setting. We propose two online training methods, Random-Based Rehearsal-RBR, and Priority-Based Rehearsal-PBR, which avoid retraining from scratch when changes occur. Our algorithms are the first streaming GNN models capable of scaling to million-edge graphs with low training latency and without compromising accuracy. We evaluate the accuracy and training performance of these experience replay methods on the node classification problem using real-world streaming graphs of various sizes and domains. Our results demonstrate that PBR and RBR achieve orders of magnitude faster training as compared to offline methods while providing high accuracy and resiliency to concept drift.
Stream processing can generate insights from big data in real time as it is being produced. This paper reports findings from a 2017 seminar on big stream processing, focusing on applications, systems, and languages.
Portals is a serverless, distributed programming model that blends the exactly-once processing guarantees of stateful dataflow streaming frameworks with the message-driven compositionality of actor frameworks. Decentralized applications in Portals can be built dynamically, scale on demand, and always satisfy strict atomic processing guarantees that are natively embedded in the framework’s principal elements of computation, known as atomic streams. In this paper, we describe the capabilities of Portals and demonstrate its use in supporting several popular existing distributed programming paradigms and use-cases. We further introduce all programming model invariants and the corresponding system methods used to satisfy them.
Modern data services need to meet application developers’ demands in terms of scalability and resilience, and also support privacy regulations such as the EU’s GDPR. We outline the main systems challenges of supporting data privacy regulations in the context of large-scale data services, and advocate for causal snapshot consistency to ensure application-level and privacy-level consistency. We present Pods, an extension to the dataflow model that allows external services to access snapshotted operator state directly, with built-in support for addressing the outlined privacy challenges, and summarize open questions and research directions.
Serverless applications spanning the cloud and edge require flexible programming frameworks for expressing compositions across the different levels of deployment. Another critical aspect for applications with state is failure resilience beyond the scope of a single dataflow graph that is the current standard in data streaming systems. This paper presents Portals, an interactive, stateful dataflow composition framework with strong end-to-end guarantees. Portals enables event-driven, resilient applications that span across dataflow graphs and serverless deployments. The demonstration exhibits three scenarios in our multi-dataflow streaming-based system: dynamically composing a stateful serverless application; an interactive cloud and edge serverless application; and a Portals browser playground. This work was partially funded by Digital Futures, the Swedish Foundation for Strategic Research under Grant No.: BD15-0006, as well as RISE AI.
This paper introduces GCNSplit, a streaming graph partitioning framework capable of handling unbounded streams with bounded state requirements. We frame partitioning as a classification problem and we employ an unsupervised model whose loss function minimizes edge-cuts. GCNSplit leverages an inductive graph convolutional network (GCN) to embed graph characteristics into a low-dimensional space and assign edges to partitions in an online manner. We evaluate GCNSplit with real-world graph datasets of various sizes and domains. Our results demonstrate that GCNSplit provides high-throughput, top-quality partitioning, and successfully leverages data parallelism. It achieves a throughput of 430K edges/s on a real-world graph of 1.6B edges using a bounded 147KB-sized model, contrary to the state-of-the-art HDRF algorithm that requires > 116GB in-memory state. With a well-balanced normalized load of 1.01, GCNSplit achieves a replication factor on par with HDRF, showcasing high partitioning quality while storing three orders of magnitude smaller partitioning state. Owing to the power of GCNs, we show that GCNSplit can generalize to entirely unseen graphs while outperforming the state-of-the-art stream partitioners in some cases.