Change search
Link to record
Permanent link

Direct link
BETA
Publications (9 of 9) Show all publications
Carbone, P. (2018). Scalable and Reliable Data Stream Processing. (Doctoral dissertation). KTH Royal Institute of Technology
Open this publication in new window or tab >>Scalable and Reliable Data Stream Processing
2018 (English)Doctoral thesis, monograph (Other academic)
Abstract [en]

Data-stream management systems have for long been considered as a promising architecture for fast data management. The stream processing paradigm poses an attractive means of declaring persistent application logic coupled with state over evolving data. However, despite contributions in programming semantics addressing certain aspects of data streaming, existing approaches have been lacking a clear, universal specification for the underlying system execution. We investigate the case of data stream processing as a general-purpose scalable computing architecture that can support continuous and iterative state-driven workloads. Furthermore, we examine how this architecture can enable the composition of reliable, reconfigurable services and complex applications that go even beyond the needs of scalable data analytics, a major trend in the past decade.

In this dissertation, we specify a set of core components and mechanisms to compose reliable data stream processing systems while adopting three crucial design principles: blocking-coordination avoidance, programming-model transparency, and compositionality. Furthermore, we identify the core open challenges among the academic and industrial state of the art and provide a complete solution using these design principles as a guide. Our contributions address the following problems: I) Reliable Execution and Stream State Management, II) Computation Sharing and Semantics for Stream Windows, and III) Iterative Data Streaming. Several parts of this work have been integrated into Apache Flink, a widely-used, open-source scalable computing framework, and supported the deployment of hundreds of long-running large-scale production pipelines worldwide.

Abstract [sv]

System för strömmande databehandling har länge ansetts vara en lovande arkitektur för snabb datahantering. Paradigmen för strömmande datahantering utgör ett attraktivt sätt att utrycka tillståndbaserad persistent tillämpningslogik över evolverande data. Men trots många bidrag i programmeringssemantik som adresserar vissa aspekter av dataströmning, har befintliga tillvägagångssätt saknat en tydlig universell specifikation för den underliggande systemexekveringen. Vi undersöker system för strömmande databehandling som en generell skalbar beräkningsarkitektur för kontinuerliga och iterativa tillämpningar. Dessutom undersöker vi hur denna arkitektur kan möjliggöra sammansättningen av pålitliga, omkonfigurerbara tjänster och komplexa tillämpningar som går utöver behoven av den för närvarande trendiga BigData-analysen.

I den här avhandlingen specificerar vi en uppsättning kärnkomponenter och mekanismer för att sätta samman tillförlitliga system för strömmande databehandling. Samtidigt antar man tre viktiga konstruktionsprinciper: undvikandet av blockerande samordning, transparens av programmeringsmodellen, och sammansättningsbarhet. Vidare identifierar vi de huvudsakliga öppna utmaningarna inom akademi och industri i området, och föreslår en fullständig lösning med hjälp av de ovan nämnda principerna som guide.Våra bidrag adresserar följande problem: I) Tillförlitlig exekvering och tillståndhantering för dataströmmar, II) delning av beräkningar och semantik för Ström Windows, och III) Iterativa dataströmmar. Flera delar av detta arbete har integrerats i Apache Flink, ett allmänt och välkänt beräkningsramverk, och har använts i hundratals storskaliga produktionssystem över hela världen.

Place, publisher, year, edition, pages
KTH Royal Institute of Technology, 2018. p. 180
Series
TRITA-EECS-AVL ; 2018:54
Keywords
distributed systems, stream processing, data management, databases, distributed computing, data processing, fault tolerance, database optimisation, programming systems, data science, data analytics, computer science
National Category
Computer Systems
Research subject
Information and Communication Technology
Identifiers
urn:nbn:se:kth:diva-233527 (URN)978-91-7729-901-1 (ISBN)
Public defence
2018-09-28, Sal A, Electrum, Kistagången 16, Kista, Stockholm, 09:00 (English)
Opponent
Supervisors
Funder
Swedish Foundation for Strategic Research
Note

QC 20180823

Available from: 2018-08-23 Created: 2018-08-22 Last updated: 2018-08-23Bibliographically approved
Kroll, L., Carbone, P. & Haridi, S. (2017). Kompics Scala: Narrowing the gap between algorithmic specification and executable code (short paper). In: Proceedings of the 8th ACM SIGPLAN International Symposium on Scala: . Paper presented at ACM SIGPLAN International Symposium on Scala (pp. 73-77). ACM Digital Library
Open this publication in new window or tab >>Kompics Scala: Narrowing the gap between algorithmic specification and executable code (short paper)
2017 (English)In: Proceedings of the 8th ACM SIGPLAN International Symposium on Scala, ACM Digital Library, 2017, p. 73-77Conference paper, Published paper (Refereed)
Abstract [en]

Message-based programming frameworks facilitate the development and execution of core distributed computing algorithms today. Their twofold aim is to expose a programming model that minimises logical errors incurred during translation from an algorithmic specification to executable program, and also to provide an efficient runtime for event pattern-matching and scheduling of distributed components. Kompics Scala is a framework that allows for a direct, streamlined translation from a formal algorithm specification to practical code by reducing the cognitive gap between the two representations. Furthermore, its runtime decouples event pattern-matching and component execution logic yielding clean, thoroughly expected behaviours. Our evaluation shows low and constant performance overhead of Kompics Scala compared to similar frameworks that otherwise fail to offer the same level of model clarity.

Place, publisher, year, edition, pages
ACM Digital Library, 2017
Keywords
component model, message-passing, distributed systems architecture
National Category
Information Systems
Research subject
Information and Communication Technology
Identifiers
urn:nbn:se:kth:diva-218781 (URN)10.1145/3136000.3136009 (DOI)2-s2.0-85037137982 (Scopus ID)978-1-4503-5529-2 (ISBN)
Conference
ACM SIGPLAN International Symposium on Scala
Note

QC 20180111

Available from: 2017-11-30 Created: 2017-11-30 Last updated: 2018-01-13Bibliographically approved
Carbone, P., Gévay, G. E., Hermann, G., Katsifodimos, A., Soto, J., Markl, V. & Haridi, S. (2017). Large-scale data stream processing systems. In: Handbook of Big Data Technologies: (pp. 219-260). Springer International Publishing
Open this publication in new window or tab >>Large-scale data stream processing systems
Show others...
2017 (English)In: Handbook of Big Data Technologies, Springer International Publishing , 2017, p. 219-260Chapter in book (Other academic)
Abstract [en]

In our data-centric society, online services, decision making, and other aspects are increasingly becoming heavily dependent on trends and patterns extracted from data. A broad class of societal-scale data management problems requires system support for processing unbounded data with low latency and high throughput. Large-scale data stream processing systems perceive data as infinite streams and are designed to satisfy such requirements. They have further evolved substantially both in terms of expressive programming model support and also efficient and durable runtime execution on commodity clusters. Expressive programming models offer convenient ways to declare continuous data properties and applied computations, while hiding details on how these data streams are physically processed and orchestrated in a distributed environment. Execution engines provide a runtime for such models further allowing for scalable yet durable execution of any declared computation. In this chapter we introduce the major design aspects of large scale data stream processing systems, covering programming model abstraction levels and runtime concerns. We then present a detailed case study on stateful stream processing with Apache Flink, an open-source stream processor that is used for a wide variety of processing tasks. Finally, we address the main challenges of disruptive applications that large-scale data streaming enables from a systemic point of view.

Place, publisher, year, edition, pages
Springer International Publishing, 2017
National Category
Computer Systems
Identifiers
urn:nbn:se:kth:diva-216552 (URN)10.1007/978-3-319-49340-4_7 (DOI)2-s2.0-85019960984 (Scopus ID)9783319493404 (ISBN)9783319493398 (ISBN)
Note

QC 20171108

Available from: 2017-11-08 Created: 2017-11-08 Last updated: 2017-11-08Bibliographically approved
Carbone, P., Ewen, S., Fora, G., Haridi, S., Richter, S. & Tzoumas, K. (2017). State Management in Apache Flink (R) Consistent Stateful Distributed Stream Processing. Proceedings of the VLDB Endowment, 10(12), 1718-1729
Open this publication in new window or tab >>State Management in Apache Flink (R) Consistent Stateful Distributed Stream Processing
Show others...
2017 (English)In: Proceedings of the VLDB Endowment, ISSN 2150-8097, E-ISSN 2150-8097, Vol. 10, no 12, p. 1718-1729Article in journal (Refereed) Published
Abstract [en]

Stream processors are emerging in industry as an apparatus that drives analytical but also mission critical services handling the core of persistent application logic. Thus, apart from scalability and low-latency, a rising system need is first-class support for application state together with strong consistency guarantees, and adaptivity to cluster reconfigurations, software patches and partial failures. Although prior systems research has addressed some of these specific problems, the practical challenge lies on how such guarantees can be materialized in a transparent, non-intrusive manner that relieves the user from unnecessary constraints. Such needs served as the main design principles of state management in Apache Flink, an open source, scalable stream processor. We present Flink's core pipelined, in-flight mechanism which guarantees the creation of lightweight, consistent, distributed snapshots of application state, progressively, without impacting continuous execution. Consistent snapshots cover all needs for system reconfiguration, fault tolerance and version management through coarse grained rollback recovery. Application state is declared explicitly to the system, allowing efficient partitioning and transparent commits to persistent storage. We further present Flink's backend implementations and mechanisms for high availability, external state queries and output commit. Finally, we demonstrate how these mechanisms behave in practice with metrics and largedeployment insights exhibiting the low performance trade-offs of our approach and the general benefits of exploiting asynchrony in continuous, yet sustainable system deployments.

National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-220296 (URN)000416494000011 ()2-s2.0-85036646347 (Scopus ID)
Note

QC 20171222

Available from: 2017-12-22 Created: 2017-12-22 Last updated: 2018-01-13Bibliographically approved
Carbone, P., Traub, J., Katsifodimo, A., Haridi, S. & Mark, V. (2016). Cutty: Aggregate Sharing for User-Defined Windows. In: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management: . Paper presented at 25th ACM International Conference on Information and Knowledge Management, CIKM 2016, Indianapolis, United States, 24 October 2016 through 28 October 2016 (pp. 1201-1210). Association for Computing Machinery (ACM), 24-28-
Open this publication in new window or tab >>Cutty: Aggregate Sharing for User-Defined Windows
Show others...
2016 (English)In: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, Association for Computing Machinery (ACM), 2016, Vol. 24-28-, p. 1201-1210Conference paper, Published paper (Refereed)
Abstract [en]

Aggregation queries on data streams are evaluated over evolving and often overlapping logical views called windows. While the aggregation of periodic windows were extensively studied in the past through the use of aggregate sharing techniques such as Panes and Pairs, little to no work has been put in optimizing the aggregation of very common, non-periodic windows. Typical examples of non-periodic windows are punctuations and sessions which can implement complex business logic and are often expressed as user-defined operators on platforms such as Google Dataflow or Apache Storm. The aggregation of such non-periodic or user-defined windows either falls back to expensive, best-effort aggregate sharing methods, or is not optimized at all.

In this paper we present a technique to perform efficient aggregate sharing for data stream windows, which are declared as user-defined functions (UDFs) and can contain arbitrary business logic. To this end, we first introduce the concept of User-Defined Windows (UDWs), a simple, UDF-based programming abstraction that allows users to programmatically define custom windows. We then define semantics for UDWs, based on which we design Cutty, a low-cost aggregate sharing technique. Cutty improves and outperforms the state of the art for aggregate sharing on single and multiple queries. Moreover, it enables aggregate sharing for a broad class of non-periodic UDWs. We implemented our techniques on Apache Flink, an open source stream processing system, and performed experiments demonstrating orders of magnitude of reduction in aggregation costs compared to the state of the art.

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM), 2016
Keywords
Computer circuits, Computer programming, Data communication systems, Knowledge management, Open source software, Open systems, Semantics
National Category
Computer Sciences
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-198942 (URN)10.1145/2983323.2983807 (DOI)000390890800124 ()2-s2.0-84996567073 (Scopus ID)978-1-4503-4073-1 (ISBN)
Conference
25th ACM International Conference on Information and Knowledge Management, CIKM 2016, Indianapolis, United States, 24 October 2016 through 28 October 2016
Note

QC 20170130

Available from: 2016-12-22 Created: 2016-12-22 Last updated: 2018-01-13Bibliographically approved
Carbone, P., Katsifodimos, A., Ewen, S., Markl, V., Haridi, S. & Tzoumas, K. (2015). Apache flink: Stream and batch processing in a single engine. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 36(4)
Open this publication in new window or tab >>Apache flink: Stream and batch processing in a single engine
Show others...
2015 (English)In: Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, Vol. 36, no 4Article in journal (Refereed) Published
Place, publisher, year, edition, pages
IEEE Computer Society, 2015
National Category
Computer Sciences
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-198940 (URN)
Note

QC 20161222

Available from: 2016-12-22 Created: 2016-12-22 Last updated: 2018-01-13Bibliographically approved
Carbone, P. & Vlassov, V. (2015). Auto-Scoring of Personalised News in the Real-Time Web: Challenges, Overview and Evaluation of the State-of-the-Art Solutions. In: : . Paper presented at Cloud and Autonomic Computing (ICCAC), 2015 International Conference on, Cambridge, MA, USA, September 21-25, 2015 (pp. 169-180). IEEE Computer Society
Open this publication in new window or tab >>Auto-Scoring of Personalised News in the Real-Time Web: Challenges, Overview and Evaluation of the State-of-the-Art Solutions
2015 (English)Conference paper, Published paper (Refereed)
Abstract [en]

The problem of automated personalised news recommendation, often referred as auto-scoring has attracted substantial research throughout the last decade in multiple domains such as data mining and machine learning, computer systems, e commerce and sociology. A typical "recommender systems" approach to solving this problem usually adopts content-based scoring, collaborative filtering or more often a hybrid approach. Due to their special nature, news articles introduce further challenges and constraints to conventional item recommendation problems, characterised by short lifetime and rapid popularity trends. In this survey, we provide an overview of the challenges and current solutions in news personalisation and ranking from both an algorithmic and system design perspective, and present our evaluation of the most representative scoring algorithms while also exploring the benefits of using a hybrid approach. Our evaluation is based on a real-life case study in news recommendations.

Place, publisher, year, edition, pages
IEEE Computer Society, 2015
Keywords
Internet, collaborative filtering, recommender systems, auto-scoring, automated personalised news recommendation, collaborative filtering, content-based scoring, hybrid approach, item recommendation problems, news personalisation, real-time Web, recommender systems approach, Algorithm design and analysis, Collaboration, Correlation, Market research, Measurement, Recommender systems, auto-scoring, data mining, machine learning, recommender systems, scoring algorithms
National Category
Computer Systems
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-179469 (URN)10.1109/ICCAC.2015.9 (DOI)000380476500016 ()2-s2.0-84962109478 (Scopus ID)
External cooperation:
Conference
Cloud and Autonomic Computing (ICCAC), 2015 International Conference on, Cambridge, MA, USA, September 21-25, 2015
Note

QC 20160121

Available from: 2015-12-17 Created: 2015-12-17 Last updated: 2016-09-05Bibliographically approved
Carbone, P., Fóra, G., Ewen, S., Haridi, S. & Tzoumas, K. (2015). Lightweight Asynchronous Snapshots for Distributed Dataflows.
Open this publication in new window or tab >>Lightweight Asynchronous Snapshots for Distributed Dataflows
Show others...
2015 (English)Report (Other academic)
Abstract [en]

Distributed stateful stream processing enables the deployment and execution of large scale continuous computations in the cloud, targeting both low latency and high throughput. One of the most fundamental challenges of this paradigm is providing processing guarantees under potential failures. Existing approaches rely on periodic global state snapshots that can be used for failure recovery. Those approaches suffer from two main drawbacks. First, they often stall the overall computation which impacts ingestion. Second, they eagerly persist all records in transit along with the operation states which results in larger snapshots than required. In this work we propose Asynchronous Barrier Snapshotting (ABS), a lightweight algorithm suited for modern dataflow execution engines that minimises space requirements. ABS persists only operator states on acyclic execution topologies while keeping a minimal record log on cyclic dataflows. We implemented ABS on Apache Flink, a distributed analytics engine that supports stateful stream processing. Our evaluation shows that our algorithm does not have a heavy impact on the execution, maintaining linear scalability and performing well with frequent snapshots. 

Publisher
p. 8
Series
TRITA-ICT ; 2015:08
Keywords
fault tolerance, distributed computing, stream processing, dataflow, cloud computing, state management
National Category
Computer Systems
Research subject
Information and Communication Technology; Computer Science
Identifiers
urn:nbn:se:kth:diva-170185 (URN)978-91-7595-651-0 (ISBN)
Note

QC 20150630

Available from: 2015-06-28 Created: 2015-06-28 Last updated: 2015-06-30Bibliographically approved
Carbone, P., Vandikas, K. & Zaloshnja, F. (2013). Towards highly available complex event processing deployments in the cloud. In: International Conference on Next Generation Mobile Applications, Services, and Technologies: . Paper presented at 7th International Conference on Next Generation Mobile Applications, Services, and Technologies, NGMAST 2013; Prague; Czech Republic; 25 September 2013 through 27 September 2013 (pp. 153-158). IEEE
Open this publication in new window or tab >>Towards highly available complex event processing deployments in the cloud
2013 (English)In: International Conference on Next Generation Mobile Applications, Services, and Technologies, IEEE , 2013, p. 153-158Conference paper, Published paper (Refereed)
Abstract [en]

Recent advances in distributed computing have made it possible to achieve high availability on traditional systems and thus serve them as reliable services. For several offline computational applications, such as fine grained batch processing, their parallel nature in addition to weak consistency requirements allowed a more trivial transition. On the other hand, on-line processing systems such as Complex Event Processing (CEP) still maintain a monolithic architecture, being able to offer high expressiveness and vertical scalability at the expense of low distribution. Despite attempts to design dedicated distributed CEP systems there is potential for existing systems to benefit from a sustainable cloud deployment. In this work we address the main challenges of providing such a CEP service with a focus on reliability, since it is the most crucial aspect of that transition. Our approach targets low average detection latency and sustain-ability by leveraging event delegation mechanisms present on existing stream execution platforms. It also introduces redundancy and transactional logging to provide improved fault tolerance and partial recovery. Our performance analysis illustrates the benefits of our approach and shows acceptable performance costs for on-line CEP exhibited by the fault tolerance mechanisms we introduced.

Place, publisher, year, edition, pages
IEEE, 2013
Series
International Conference on Next Generation Mobile Applications, Services, and Technologies, ISSN 2161-2889
Keywords
CDR, Complex event processing, Distributed stream processing, Fault tolerance, Fraud detection, SIP
National Category
Computer and Information Sciences
Identifiers
urn:nbn:se:kth:diva-143856 (URN)10.1109/NGMAST.2013.35 (DOI)2-s2.0-84892755705 (Scopus ID)978-147992019-8 (ISBN)
Conference
7th International Conference on Next Generation Mobile Applications, Services, and Technologies, NGMAST 2013; Prague; Czech Republic; 25 September 2013 through 27 September 2013
Note

QC 20140403

Available from: 2014-04-03 Created: 2014-03-31 Last updated: 2018-01-11Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0002-9351-8508

Search in DiVA

Show all publications