kth.sePublications
Change search
Link to record
Permanent link

Direct link
Horchidan, Sonia-FlorinaORCID iD iconorcid.org/0000-0002-8573-0090
Publications (6 of 6) Show all publications
Horchidan, S.-F., Chen, P. H., Kritharakis, E., Carbone, P. & Kalavri, V. (2024). Crayfish: Navigating the Labyrinth of Machine Learning Inference in Stream Processing Systems. In: Advances in Database Technology - EDBT: . Paper presented at 27th International Conference on Extending Database Technology, EDBT 2024, Paestum, Italy, Mar 25 2024 - Mar 28 2024 (pp. 676-689). Open Proceedings.org, 27, Article ID 3.
Open this publication in new window or tab >>Crayfish: Navigating the Labyrinth of Machine Learning Inference in Stream Processing Systems
Show others...
2024 (English)In: Advances in Database Technology - EDBT, Open Proceedings.org , 2024, Vol. 27, p. 676-689, article id 3Conference paper, Published paper (Refereed)
Abstract [en]

As Machine Learning predictions are increasingly being used in business analytics pipelines, integrating stream processing with model serving has become a common data engineering task. Despite their synergies, separate software stacks typically handle streaming analytics and model serving. Systems for data stream management do not support ML inference out-of-the-box, while model-serving frameworks have limited functionality for continuous data transformations, windowing, and other streaming tasks. As a result, developers are left with a design space dilemma whose trade-offs are not well understood. This paper presents Crayfish, an extensible benchmarking framework that facilitates designing and executing comprehensive evaluation studies of streaming inference pipelines. We demonstrate the capabilities of Crayfish by studying four data processing systems, three embedded libraries, three external serving frameworks, and two pre-trained models. Our results prove the necessity of a standardized benchmarking framework and show that (1) even for serving tools in the same category, the performance can vary greatly and, sometimes, defy intuition, (2) GPU accelerators can show compelling improvements for the serving task, but the improvement varies across tools, and (3) serving alternatives can achieve significantly different performance, depending on the stream processors they are integrated with.

Place, publisher, year, edition, pages
Open Proceedings.org, 2024
Series
Advances in Database Technology - EDBT, ISSN 2367-2005 ; 27
National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-346149 (URN)10.48786/edbt.2024.58 (DOI)2-s2.0-85190993856 (Scopus ID)
Conference
27th International Conference on Extending Database Technology, EDBT 2024, Paestum, Italy, Mar 25 2024 - Mar 28 2024
Note

QC 20240507

Part of ISBN:

978-389318091-2, 978-389318094-3, 978-389318095-0

Available from: 2024-05-03 Created: 2024-05-03 Last updated: 2024-05-07Bibliographically approved
Horchidan, S. & Carbone, P. (2023). ORB: Empowering Graph Queries through Inference. In: ESWC-JP 2023: Joint Proceedings of the ESWC 2023 Workshops and Tutorials, co-located with 20th European Semantic Web Conference, ESWC 2023. Paper presented at Joint of the 20th European Semantic Web Conference - Workshops and Tutorials, ESWC-JP 2023, Hersonissos, Greece, May 28 2023 - May 29 2023. CEUR-WS
Open this publication in new window or tab >>ORB: Empowering Graph Queries through Inference
2023 (English)In: ESWC-JP 2023: Joint Proceedings of the ESWC 2023 Workshops and Tutorials, co-located with 20th European Semantic Web Conference, ESWC 2023, CEUR-WS , 2023Conference paper, Published paper (Refereed)
Abstract [en]

Executing queries on incomplete, sparse knowledge graphs yields incomplete results, especially when it comes to queries involving traversals. In this paper, we question the applicability of all known architectures for incomplete knowledge bases and propose ORB: a clear departure from existing system designs, relying on Machine Learning-based operators to provide inferred query results. At the same time, ORB addresses peculiarities inherent to knowledge graphs, such as schema evolution, dynamism, scalability, as well as high query complexity via the use of embedding-driven inference. Through ORB, we stress that approximating complex processing tasks is not only desirable but also imperative for knowledge graphs.

Place, publisher, year, edition, pages
CEUR-WS, 2023
National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-336774 (URN)2-s2.0-85168679093 (Scopus ID)
Conference
Joint of the 20th European Semantic Web Conference - Workshops and Tutorials, ESWC-JP 2023, Hersonissos, Greece, May 28 2023 - May 29 2023
Note

QC 20230920

Available from: 2023-09-20 Created: 2023-09-20 Last updated: 2023-09-20Bibliographically approved
Horchidan, S. (2023). Query Optimization for Inference-Based Graph Databases. In: VLDB-PhD 2023 - Proceedings of the VLDB 2023 PhD Workshop, co-located with the 49th International Conference on Very Large Data Bases, VLDB 2023: . Paper presented at 49th International Conference on Very Large Data Bases PhD Workshop, VLDB-PhD Workshop 2023, Vancouver, Canada, Aug 28 2023 (pp. 33-36). CEUR-WS
Open this publication in new window or tab >>Query Optimization for Inference-Based Graph Databases
2023 (English)In: VLDB-PhD 2023 - Proceedings of the VLDB 2023 PhD Workshop, co-located with the 49th International Conference on Very Large Data Bases, VLDB 2023, CEUR-WS , 2023, p. 33-36Conference paper, Published paper (Refereed)
Abstract [en]

Knowledge Graphs are commonly characterized by two challenges: massive scale and sparsity. The former leads to slow response times for complex queries with random data accesses, especially when they require deep graph traversals. The latter, which is caused by missing connections and characteristics in graphs modeling real information, implies that any analysis based solely on explicitly stored data is bound to yield incomplete results. This work aims to develop a novel graph database architecture that leverages the power of Graph Machine Learning to equip graph queries with prediction capabilities while offering approximate but timely results to complex queries. We discuss challenges, design decisions, and research avenues required in materializing this prototype alongside the outline of the actively-pursued research plan.

Place, publisher, year, edition, pages
CEUR-WS, 2023
Keywords
Graph Databases, Graph Representation Learning, Query Optimization, Uncertainty Estimation
National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-336736 (URN)2-s2.0-85169435702 (Scopus ID)
Conference
49th International Conference on Very Large Data Bases PhD Workshop, VLDB-PhD Workshop 2023, Vancouver, Canada, Aug 28 2023
Note

QC 20230919

Available from: 2023-09-19 Created: 2023-09-19 Last updated: 2023-09-19Bibliographically approved
Horchidan, S.-F., Kritharakis, E., Kalavri, V. & Carbone, P. (2022). Evaluating model serving strategies over streaming data. In: Proceedings of the 6th Workshop on Data Management for End-To-End Machine Learning, DEEM 2022 - In conjunction with the 2022 ACM SIGMOD/PODS Conference: . Paper presented at 6th Workshop on Data Management for End-To-End Machine Learning, DEEM 2022 - In conjunction with the 2022 ACM SIGMOD/PODS Conference, 12 June 2022, Virtual, Online. Association for Computing Machinery (ACM), Article ID 4.
Open this publication in new window or tab >>Evaluating model serving strategies over streaming data
2022 (English)In: Proceedings of the 6th Workshop on Data Management for End-To-End Machine Learning, DEEM 2022 - In conjunction with the 2022 ACM SIGMOD/PODS Conference, Association for Computing Machinery (ACM) , 2022, article id 4Conference paper, Published paper (Refereed)
Abstract [en]

We present the first performance evaluation study of model serving integration tools in stream processing frameworks. Using Apache Flink as a representative stream processing system, we evaluate alternative Deep Learning serving pipelines for image classification. Our performance evaluation considers both the case of embedded use of Machine Learning libraries within stream tasks and that of external serving via Remote Procedure Calls. The results indicate superior throughput and scalability for pipelines that make use of embedded libraries to serve pre-trained models. Whereas, latency can vary across strategies, with external serving even achieving lower latency when network conditions are optimal due to better specialized use of underlying hardware. We discuss our findings and provide further motivating arguments towards research in the area of ML-native data streaming engines in the future.

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM), 2022
Keywords
data streams, machine learning inference
National Category
Other Computer and Information Science
Identifiers
urn:nbn:se:kth:diva-317104 (URN)10.1145/3533028.3533308 (DOI)2-s2.0-85133190958 (Scopus ID)
Conference
6th Workshop on Data Management for End-To-End Machine Learning, DEEM 2022 - In conjunction with the 2022 ACM SIGMOD/PODS Conference, 12 June 2022, Virtual, Online
Note

QC 20220906

Part of proceedings: ISBN 978-145039375-1

Available from: 2022-09-06 Created: 2022-09-06 Last updated: 2022-09-06Bibliographically approved
Zwolak, M., Abbas, Z., Horchidan, S.-F., Carbone, P. & Kalavri, V. (2022). GCNSplit: Bounding the State of Streaming Graph Partitioning. In: Proceedings of the 5th International Workshop on Exploiting Artificial Intelligence Techniques for Data Management, aiDM 2022 - In conjunction with the 2022 ACM SIGMOD/PODS Conference: . Paper presented at 5th International Workshop on Exploiting Artificial Intelligence Techniques for Data Management, aiDM 2022 - In conjunction with the 2022 ACM SIGMOD/PODS Conference, 17 June 2022, Philadelphia, USA. Association for Computing Machinery, Inc, Article ID 3.
Open this publication in new window or tab >>GCNSplit: Bounding the State of Streaming Graph Partitioning
Show others...
2022 (English)In: Proceedings of the 5th International Workshop on Exploiting Artificial Intelligence Techniques for Data Management, aiDM 2022 - In conjunction with the 2022 ACM SIGMOD/PODS Conference, Association for Computing Machinery, Inc , 2022, article id 3Conference paper, Published paper (Refereed)
Abstract [en]

This paper introduces GCNSplit, a streaming graph partitioning framework capable of handling unbounded streams with bounded state requirements. We frame partitioning as a classification problem and we employ an unsupervised model whose loss function minimizes edge-cuts. GCNSplit leverages an inductive graph convolutional network (GCN) to embed graph characteristics into a low-dimensional space and assign edges to partitions in an online manner. We evaluate GCNSplit with real-world graph datasets of various sizes and domains. Our results demonstrate that GCNSplit provides high-throughput, top-quality partitioning, and successfully leverages data parallelism. It achieves a throughput of 430K edges/s on a real-world graph of 1.6B edges using a bounded 147KB-sized model, contrary to the state-of-the-art HDRF algorithm that requires > 116GB in-memory state. With a well-balanced normalized load of 1.01, GCNSplit achieves a replication factor on par with HDRF, showcasing high partitioning quality while storing three orders of magnitude smaller partitioning state. Owing to the power of GCNs, we show that GCNSplit can generalize to entirely unseen graphs while outperforming the state-of-the-art stream partitioners in some cases.

Place, publisher, year, edition, pages
Association for Computing Machinery, Inc, 2022
Keywords
data streams, graph neural networks, graph partitioning
National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-317564 (URN)10.1145/3533702.3534920 (DOI)2-s2.0-85137089721 (Scopus ID)
Conference
5th International Workshop on Exploiting Artificial Intelligence Techniques for Data Management, aiDM 2022 - In conjunction with the 2022 ACM SIGMOD/PODS Conference, 17 June 2022, Philadelphia, USA
Note

QC 20220914

Part of proceedings: ISBN 978-145039377-5

Available from: 2022-09-14 Created: 2022-09-14 Last updated: 2022-09-14Bibliographically approved
Imtiaz, S., Horchidan, S.-F., Abbas, Z., Arsalan, M., Chaudhry, H. N. & Vlassov, V. (2020). Privacy Preserving Time-Series Forecasting of User Health Data Streams. In: 2020 IEEE International Conference on Big Data (Big Data): . Paper presented at 2020 IEEE International Conference on Big Data (Big Data) (pp. 3428-3437). Institute of Electrical and Electronics Engineers (IEEE)
Open this publication in new window or tab >>Privacy Preserving Time-Series Forecasting of User Health Data Streams
Show others...
2020 (English)In: 2020 IEEE International Conference on Big Data (Big Data), Institute of Electrical and Electronics Engineers (IEEE) , 2020, p. 3428-3437Conference paper, Published paper (Refereed)
Abstract [en]

Privacy preservation plays a vital role in health care applications as the requirements for privacy preservation are very strict in this domain. With the rapid increase in the amount, quality and detail of health data being gathered with smart devices, new mechanisms are required that can cope with the challenges of large scale and real-time processing requirements. Federated learning (FL) is one of the conventional approaches that facilitate the training of AI models without access to the raw data. However, recent studies have shown that FL alone does not guarantee sufficient privacy. Differential privacy (DP) is a well-known approach for privacy guarantees, however, because of the noise addition, DP needs to make a trade-off between privacy and accuracy. In this work, we design and implement an end-to-end pipeline using DP and FL for the first time in the context of health data streams. We propose a clustering mechanism to leverage the similarities between users to improve the prediction accuracy as well as significantly reduce the model training time. Depending on the dataset and features, our predictions are no more than 0.025% far off the ground-truth value with respect to the range of value. Moreover, our clustering mechanism brings a significant reduction in the training time, with up to 49% reduction in prediction accuracy error in the best case, as compared to training a single model on the entire dataset. Our proposed privacy preserving mechanism at best introduces a decrease of ≈ 2% in the prediction accuracy of the trained models. Furthermore, our proposed clustering mechanism reduces the prediction error even in highly noisy settings by as much as 38% as compared to using a single federated private model.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2020
Keywords
Federated Learning, Differential Privacy, Streaming k-means, Generative Adversarial Networks
National Category
Computer Sciences
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-295068 (URN)10.1109/BigData50022.2020.9378186 (DOI)000662554703071 ()2-s2.0-85103842271 (Scopus ID)
Conference
2020 IEEE International Conference on Big Data (Big Data)
Note

QC 20210602

Available from: 2021-05-18 Created: 2021-05-18 Last updated: 2023-03-06Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0002-8573-0090

Search in DiVA

Show all publications