kth.sePublications
Change search
Link to record
Permanent link

Direct link
Publications (10 of 14) Show all publications
Ghasemirahni, H., Barbette, T., Katsikas, G. P., Farshin, A., Roozbeh, A., Girondi, M., . . . Kostic, D. (2022). Packet Order Matters! Improving Application Performance by Deliberately Delaying Packets. In: Proceedings of the 19th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2022: . Paper presented at 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI), APR 04-06, 2022, Renton, WA (pp. 807-827). USENIX - The Advanced Computing Systems Association
Open this publication in new window or tab >>Packet Order Matters! Improving Application Performance by Deliberately Delaying Packets
Show others...
2022 (English)In: Proceedings of the 19th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2022, USENIX - The Advanced Computing Systems Association, 2022, p. 807-827Conference paper, Published paper (Refereed)
Abstract [en]

Data centers increasingly deploy commodity servers with high-speed network interfaces to enable low-latency communication. However, achieving low latency at high data rates crucially depends on how the incoming traffic interacts with the system's caches. When packets that need to be processed in the same way are consecutive, i.e., exhibit high temporal and spatial locality, caches deliver great benefits.

In this paper, we systematically study the impact of temporal and spatial traffic locality on the performance of commodity servers equipped with high-speed network interfaces. Our results show that (i) the performance of a variety of widely deployed applications degrades substantially with even the slightest lack of traffic locality, and (ii) a traffic trace from our organization reveals poor traffic locality as networking protocols, drivers, and the underlying switching/routing fabric spread packets out in time (reducing locality). To address these issues, we built Reframer, a software solution that deliberately delays packets and reorders them to increase traffic locality. Despite introducing μs-scale delays of some packets, we show that Reframer increases the throughput of a network service chain by up to 84% and reduces the flow completion time of a web server by 11% while improving its throughput by 20%.

Place, publisher, year, edition, pages
USENIX - The Advanced Computing Systems Association, 2022
Keywords
packet ordering, spatial and temporal locality, packet scheduling, batch processing, high-speed networking
National Category
Communication Systems
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-304656 (URN)000876762200046 ()2-s2.0-85140983450 (Scopus ID)
Conference
19th USENIX Symposium on Networked Systems Design and Implementation (NSDI), APR 04-06, 2022, Renton, WA
Projects
ULTRAWASPTime-Critical Clouds
Funder
Swedish Foundation for Strategic ResearchKnut and Alice Wallenberg FoundationEU, European Research Council
Note

QC 20230619

Available from: 2021-11-09 Created: 2021-11-09 Last updated: 2023-06-19Bibliographically approved
Farshin, A., Barbette, T., Roozbeh, A., Maguire Jr., G. Q. & Kostic, D. (2021). PacketMill: Toward Per-Core 100-Gbps Networking. In: Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS): . Paper presented at 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’21), 19–23 April, 2021, Virtual/Online. ACM Digital Library
Open this publication in new window or tab >>PacketMill: Toward Per-Core 100-Gbps Networking
Show others...
2021 (English)In: Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), ACM Digital Library, 2021Conference paper, Published paper (Refereed)
Abstract [en]

We present PacketMill, a system for optimizing software packet processing, which (i) introduces a new model to efficiently manage packet metadata and (ii) employs code-optimization techniques to better utilize commodity hardware. PacketMill grinds the whole packet processing stack, from the high-level network function configuration file to the low-level userspace network (specifically DPDK) drivers, to mitigate inefficiencies and produce a customized binary for a given network function. Our evaluation results show that PacketMill increases throughput (up to 36.4 Gbps -- 70%) & reduces latency (up to 101 us -- 28%) and enables nontrivial packet processing (e.g., router) at ~100 Gbps, when new packets arrive >10× faster than main memory access times, while using only one processing core.

Place, publisher, year, edition, pages
ACM Digital Library, 2021
Keywords
PacketMill, X-Change, Packet Processing, Metadata Management, 100-Gbps Networking, Middleboxes, Commodity Hardware, LLVM, Compiler Optimizations, Full-Stack Optimization, FastClick, DPDK.
National Category
Communication Systems
Identifiers
urn:nbn:se:kth:diva-289665 (URN)10.1145/3445814.3446724 (DOI)000829871000001 ()2-s2.0-85104694209 (Scopus ID)
Conference
26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’21), 19–23 April, 2021, Virtual/Online
Projects
Time-Critical CloudsULTRAWASP
Funder
Wallenberg AI, Autonomous Systems and Software Program (WASP)Swedish Foundation for Strategic ResearchEU, Horizon 2020, 770889
Note

Part of proceedings: ISBN 978-1-4503-8317-2

QC 20210210

Available from: 2021-02-10 Created: 2021-02-10 Last updated: 2024-03-15Bibliographically approved
Roozbeh, A. (2021). Realizing Next-Generation Data Centers via Software-Defined “Hardware” Infrastructures and Resource Disaggregation: Exploiting your cache. (Doctoral dissertation). Stockholm: KTH Royal Institute of Technology
Open this publication in new window or tab >>Realizing Next-Generation Data Centers via Software-Defined “Hardware” Infrastructures and Resource Disaggregation: Exploiting your cache
2021 (English)Doctoral thesis, monograph (Other academic)
Abstract [en]

The cloud is evolving due to additional demands introduced by new technological advancements and the wide movement toward digitalization. Moreover, next-generation Data Centers (DCs) and clouds are expected (and need) to become cheaper, more efficient, and capable of offering more predictable services. Aligned with this, this thesis examines the concept of Software-Defined “Hardware” Infrastructure (SDHI) based on hardware resource disaggregation as one possible way of realizing next-generation DCs. This thesis starts with an overview of the functional architecture of a cloud based on SDHI. Following this, a series of use-cases and deployment scenarios enabled by SDHI are discussed along with an exploration of the role of each functional block of SDHI’s architecture, i.e., cloud infrastructure, cloud platforms, cloud execution environments, and applications.

This thesis proposes a framework to evaluate the impact of SDHI on the techno-economic efficiency of DCs, explicitly focusing on application profiling, hardware dimensioning, and Total Cost of Ownership (TCO). It then shows that combining resource disaggregation and software-defined capabilities makes DCs less expensive and easier to expand; hence, they can rapidly follow the expected exponential demand growth. Additionally, this thesis elaborates the technologies underlying SDHI, its challenges, and its potential future directions.

It is advocated that achieving and maintaining a high level of memory performance is crucial for realizing SDHI & disaggregated DC. Nevertheless, a memory management and Input/Output (I/O) data management scheme suitable for SDHI is proposed and its advantages are shown. This work focuses on the management of Last Level Cache (LLC) in currently available Intel processors, takes advantage of LLC’s Non-Uniform Cache Architectures (NUCA), and investigates how better utilization of LLC can provide higher performance, more predictable response time, and improved isolation between threads. Additionally, this thesis scrutinizes the impact of cache management, specifically Direct Cache Access (DCA), on the performance of I/O intensive applications. The results of an empirical study shows that the proposed memory management scheme enables system designers and developers to optimize systems for I/O intensive applications and highlights some potential changes expected for I/O management in future DC systems.

Abstract [sv]

Nya tekniska framsteg och krav samt den breda digitaliseringen gör att molnet utvecklas i allt snabbare takt. Det gör att nästa generations datacenter förväntas vara billigare, mer effektiva, och dessutom kunna erbjuda robustare tjänster. I linje med detta undersöker vi mjukvarudefinierade hårdvaru-infrastrukturer (MDHI) baserade på disaggregering av hårdvaruresurser som en potentiell teknologi för att realisera nästa generationens datacenter. Vi inleder med en översikt av den funktionella arkitekturen för ett moln baserat på MDHI. Vi analyserar sedan ett antal användningsfall och utplaceringsscenarion som möjliggörs av MDHI, och undersöker den roll som varje funktionellt block i en MDHI arkitektur kan spela: Till exempel molninfrastruktur, molnplattform, molnexekveringsmiljö och tillämpningar.</p><p>Denna avhandling föreslår sedan ett ramverk för att utvärdera den effekt som MDHI kan ha på den tekno-ekonomiska effektiviteten i ett datacenter, med särskilt fokus på profilering av tillämpningar, hårdvaru-dimensionering samt total ägandekostnad. Våra studier visar att datacenter kan bli billigare och lättare att expandera om man använder sig av resurs-disaggregering och mjukvarudefiniering; därmed kan de också snabbare följa den exponentiellt ökande efterfrågan. Vi undersöker också teknologierna bakom MDHI, dess utmaningar och möjliga framtida utvecklingsriktningar.</p><p>Det förespråkas inse MDHI, det är avgörande för att uppnå och bibehålla en hög minnesprestanda. Ändå föreslog vi minneshantering och I/O -datahanteringsschema som är lämpligt för MDHI och visade deras fördelar. Detta reporten fokuserar på hanteringen av sista-nivå-cacheminne i för närvarande tillgängliga Intel -processorer, utnyttjarcacheminne nonuniform architecture och undersöker hur bättre utnyttjandeav sista-nivåcacheminne kan ge högre prestanda, mer förutsägbar svarstidoch förbättrad isolering mellan trådarna. Dessutom granskar denna avhandling effekterna av cachehantering, särskilt direkt cacheadkomst, på prestanda för I/O - intensiva applikationer. Resultaten av en empirisk studie visar att det föreslagna minneshanteringsschemat gör det möjligt för systemdesigners och utvecklare att optimera system för I/O intensiva applikationer och vi belyser några potentiella förändringar som förväntas för I/O hantering i framtida datacenter.

Place, publisher, year, edition, pages
Stockholm: KTH Royal Institute of Technology, 2021. p. 201
Series
TRITA-EECS-AVL ; 2021:77
Keywords
Cloud computing, Next-generation data centers, Resource disaggregation, Total cost of ownership, Last level cache, Direct cache access
National Category
Communication Systems
Research subject
Information and Communication Technology
Identifiers
urn:nbn:se:kth:diva-304722 (URN)978-91-8040-065-7 (ISBN)
Public defence
2021-12-03, Ka-Sal C (Sven-Olof Öhrvik), Kistagången 16, Kista, 16:00 (English)
Opponent
Supervisors
Funder
Wallenberg AI, Autonomous Systems and Software Program (WASP)
Note

QC 20211115

Available from: 2021-11-15 Created: 2021-11-10 Last updated: 2025-03-05Bibliographically approved
Roozbeh, A., Farshin, A., Kostic, D. & Maguire Jr., G. Q. (2020). Methods and devices for controlling memory handling. us US12111768B2.
Open this publication in new window or tab >>Methods and devices for controlling memory handling
2020 (English)Patent (Other (popular science, discussion, etc.))
Abstract [en]

A method and device for controlling memory handling in a processing system comprising a cache shared between a plurality of processing units, wherein the cache comprises a plurality of cache portions. The method comprises obtaining first information pertaining to an allocation of a first memory portion of a memory to a first application, an allocation of a first processing unit of the plurality of processing units to the first application, and an association between a first cache portion of the plurality of cache portions and the first processing unit. The method further comprises reconfiguring a mapping configuration based on the obtained first information, and controlling a providing of first data associated with the first application to the first cache portion from the first memory portion using the reconfigured mapping configuration.

Keywords
memory handling, shared cache
National Category
Computer Systems
Identifiers
urn:nbn:se:kth:diva-358308 (URN)
Patent
US US12111768B2 (2024-10-08)
Funder
Wallenberg AI, Autonomous Systems and Software Program (WASP)
Note

QC 20250120

Available from: 2025-01-10 Created: 2025-01-10 Last updated: 2025-01-20Bibliographically approved
Farshin, A., Roozbeh, A., Maguire Jr., G. Q. & Kostic, D. (2020). Optimizing Intel Data Direct I/O Technology for Multi-hundred-gigabit Networks. In: Proceedings of the Fifteenth EuroSys Conference (EuroSys'20), Heraklion, Crete, Greece, April 27-30, 2020.: . Paper presented at Fifteenth EuroSys Conference (EuroSys'20), Heraklion, Crete, Greece, April 27-30, 2020..
Open this publication in new window or tab >>Optimizing Intel Data Direct I/O Technology for Multi-hundred-gigabit Networks
2020 (English)In: Proceedings of the Fifteenth EuroSys Conference (EuroSys'20), Heraklion, Crete, Greece, April 27-30, 2020., 2020Conference paper, Poster (with or without abstract) (Refereed) [Artistic work]
Abstract [en]

Digitalization across society is expected to produce a massive amount of data, leading to the introduction of faster network interconnects. In addition, many Internet services require high throughput and low latency. However, having only faster links does not guarantee throughput or low latency. Therefore, it is essential to perform holistic system optimization to fully take advantage of the faster links to provide high-performance services. Intel Data Direct I/O (DDIO) is a recent technology that was introduced to facilitate the deployment of high-performance services based on fast interconnects. We evaluated the effectiveness of DDIO for multi-hundred-gigabit networks. This paper briefly discusses our findings on DDIO, which show the necessity of optimizing/adapting it to address the challenges of multi-hundred-gigabit-per-second links.

Keywords
Data Direct I/O technology, DDIO, Optimizing, Characteristic, Multi-hundred-gigabit networks.
National Category
Communication Systems
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-272720 (URN)
Conference
Fifteenth EuroSys Conference (EuroSys'20), Heraklion, Crete, Greece, April 27-30, 2020.
Projects
Time-Critical CloudsULTRAWASP
Funder
Wallenberg AI, Autonomous Systems and Software Program (WASP)Swedish Foundation for Strategic Research EU, Horizon 2020, 770889
Note

QC 20200626

Available from: 2020-04-27 Created: 2020-04-27 Last updated: 2022-06-26Bibliographically approved
Farshin, A., Roozbeh, A., Maguire Jr., G. Q. & Kostic, D. (2020). Reexamining Direct Cache Access to Optimize I/O Intensive Applications for Multi-hundred-gigabit Networks. In: 2020 USENIX Annual Technical Conference (USENIX ATC 20): . Paper presented at USENIX ATC'20 (pp. 673-689).
Open this publication in new window or tab >>Reexamining Direct Cache Access to Optimize I/O Intensive Applications for Multi-hundred-gigabit Networks
2020 (English)In: 2020 USENIX Annual Technical Conference (USENIX ATC 20), 2020, p. 673-689Conference paper, Published paper (Refereed)
Abstract [en]

Memory access is the major bottleneck in realizing multi-hundred-gigabit networks with commodity hardware, hence it is essential to make good use of cache memory that is a faster, but smaller memory closer to the processor. Our goal is to study the impact of cache management on the performance of I/O intensive applications. Specifically, this paper looks at one of the bottlenecks in packet processing, i.e., direct cache access (DCA). We systematically studied the current implementation of DCA in Intel processors, particularly Data Direct I/O technology (DDIO), which directly transfers data between I/O devices and the processor's cache. Our empirical study enables system designers/developers to optimize DDIO-enabled systems for I/O intensive applications. We demonstrate that optimizing DDIO could reduce the latency of I/O intensive network functions running at 100 Gbps by up to ~30%. Moreover, we show that DDIO causes a 30% increase in tail latencies when processing packets at 200 Gbps, hence it is crucial to selectively inject data into the cache or to explicitly bypass it.

Keywords
Direct Cache Access (DCA), Data Direct I/O Technology (DDIO), Cache Injection, Tuning, IIO LLC WAYS Register, Bypassing Cache, Characteristic, Multi-hundred-gigabit networks.
National Category
Communication Systems
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-278513 (URN)000696712200046 ()2-s2.0-85091923908 (Scopus ID)
Conference
USENIX ATC'20
Projects
Time-Critical CloudsULTRAWASP
Funder
Swedish Foundation for Strategic ResearchWallenberg AI, Autonomous Systems and Software Program (WASP)EU, Horizon 2020, 770889
Note

QC 20200714

Available from: 2020-07-11 Created: 2020-07-11 Last updated: 2024-03-15Bibliographically approved
Roozbeh, A., Kostic, D., Maguire Jr., G. Q. & Farshin, A. (2019). Entities, system and methods performed therein for handling memory operations of anapplication in a computer environment. us US12111766B2.
Open this publication in new window or tab >>Entities, system and methods performed therein for handling memory operations of anapplication in a computer environment
2019 (English)Patent (Other (popular science, discussion, etc.))
Abstract [en]

Embodiments herein relates e.g., to a method performed by a first entity, for handling memory operations of an application in a computer environment, is provided. The first entity obtains position data associated with data of the application being fragmented into a number of positions in a physical memory. The position data indicates one or more positions of the number of positions in the physical memory. The first entity then provides, to a second entity, one or more indications of the one or more positions indicated by the position data for prefetching data from the second entity, using the one or more indications.

Keywords
memory operations, prefetching
National Category
Computer Systems
Identifiers
urn:nbn:se:kth:diva-358307 (URN)
Patent
US US12111766B2 (2024-10-08)
Funder
Wallenberg AI, Autonomous Systems and Software Program (WASP)
Note

QC 20250120

Available from: 2025-01-10 Created: 2025-01-10 Last updated: 2025-01-20Bibliographically approved
Farshin, A., Roozbeh, A., Maguire Jr., G. Q. & Kostic, D. (2019). Make the Most out of Last Level Cache in Intel Processors. In: Proceedings of the Fourteenth EuroSys Conference (EuroSys'19), Dresden, Germany, 25-28 March 2019.: . Paper presented at EuroSys'19. ACM Digital Library
Open this publication in new window or tab >>Make the Most out of Last Level Cache in Intel Processors
2019 (English)In: Proceedings of the Fourteenth EuroSys Conference (EuroSys'19), Dresden, Germany, 25-28 March 2019., ACM Digital Library, 2019Conference paper, Published paper (Refereed)
Abstract [en]

In modern (Intel) processors, Last Level Cache (LLC) is divided into multiple slices and an undocumented hashing algorithm (aka Complex Addressing) maps different parts of memory address space among these slices to increase the effective memory bandwidth. After a careful study of Intel’s Complex Addressing, we introduce a slice-aware memory management scheme, wherein frequently used data can be accessed faster via the LLC. Using our proposed scheme, we show that a key-value store can potentially improve its average performance ∼12.2% and ∼11.4% for 100% & 95% GET workloads, respectively. Furthermore, we propose CacheDirector, a network I/O solution which extends Direct Data I/O (DDIO) and places the packet’s header in the slice of the LLC that is closest to the relevant processing core. We implemented CacheDirector as an extension to DPDK and evaluated our proposed solution for latency-critical applications in Network Function Virtualization (NFV) systems. Evaluation results show that CacheDirector makes packet processing faster by reducing tail latencies (90-99th percentiles) by up to 119 µs (∼21.5%) for optimized NFV service chains that are running at 100 Gbps. Finally, we analyze the effectiveness of slice-aware memory management to realize cache isolation

Place, publisher, year, edition, pages
ACM Digital Library, 2019
Keywords
Slice-aware Memory Management, Last Level Cache, Non-Uniform Cache Architecture, CacheDirector, DDIO, DPDK, Network Function Virtualization, Cache Partitioning, Cache Allocation Technology, Key-Value Store.
National Category
Communication Systems
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-244750 (URN)10.1145/3302424.3303977 (DOI)000470898700008 ()2-s2.0-85063919722 (Scopus ID)
Conference
EuroSys'19
Projects
Time-Critical CloudsULTRAWASP
Funder
Wallenberg AI, Autonomous Systems and Software Program (WASP)Swedish Foundation for Strategic ResearchEU, Horizon 2020, 770889
Note

QC 20190226

Part of ISBN 9781450362818

Available from: 2019-02-24 Created: 2019-02-24 Last updated: 2024-10-24Bibliographically approved
Roozbeh, A., Farshin, A., Kostic, D. & Maguire Jr., G. Q. (2018). Methods and nodes for handling memory. us US11714753B2.
Open this publication in new window or tab >>Methods and nodes for handling memory
2018 (English)Patent (Other (popular science, discussion, etc.))
Abstract [en]

A method in a multi-core processing system which comprises a processor comprising at least a first and a second processing unit, a cache, common to the first and the second processing unit, comprising a first cache portion associated with the first processing unit and a second cache portion associated with the second processing unit, a memory, comprising a first memory portion associated with the first cache portion and a second memory portion associated with the second cache portion. The method comprises detecting that a data access criteria of the second memory portion is fulfilled, establishing that first data stored in the second memory portion is related to a first application running on the first processing unit, allocating at least a part of the first memory portion to the first application based on cache information, and migrating the first data to the part of first memory portion.

Keywords
memory handling, cache, multi-core processing system
National Category
Computer Systems
Identifiers
urn:nbn:se:kth:diva-358309 (URN)
Patent
US US11714753B2 (2023-08-01)
Funder
Wallenberg AI, Autonomous Systems and Software Program (WASP)
Note

QC 20250120

Available from: 2025-01-10 Created: 2025-01-10 Last updated: 2025-01-20Bibliographically approved
Roozbeh, A., Soares, J., Maguire Jr., G. Q., Wuhib, F., Padala, C., Mahloo, M., . . . Kostic, D. (2018). Software-Defined "Hardware" Infrastructures: A Survey on Enabling Technologies and Open Research Directions. IEEE Communications Surveys and Tutorials, 20(3), 2454-2485
Open this publication in new window or tab >>Software-Defined "Hardware" Infrastructures: A Survey on Enabling Technologies and Open Research Directions
Show others...
2018 (English)In: IEEE Communications Surveys and Tutorials, E-ISSN 1553-877X, Vol. 20, no 3, p. 2454-2485Article in journal (Refereed) Published
Abstract [en]

This paper provides an overview of software-defined "hardware" infrastructures (SDHI). SDHI builds upon the concept of hardware (HW) resource disaggregation. HW resource disaggregation breaks today's physical server-oriented model where the use of a physical resource (e.g., processor or memory) is constrained to a physical server's chassis. SDHI extends the definition of of software-defined infrastructures (SDI) and brings greater modularity, flexibility, and extensibility to cloud infrastructures, thus allowing cloud operators to employ resources more efficiently and allowing applications not to be bounded by the physical infrastructure's layout. This paper aims to be an initial introduction to SDHI and its associated technological advancements. This paper starts with an overview of the cloud domain and puts into perspective some of the most prominent efforts in the area. Then, it presents a set of differentiating use-cases that SDHI enables. Next, we state the fundamentals behind SDI and SDHI, and elaborate why SDHI is of great interest today. Moreover, it provides an overview of the functional architecture of a cloud built on SDHI, exploring how the impact of this transformation goes far beyond the cloud infrastructure level in its impact on platforms, execution environments, and applications. Finally, an in-depth assessment is made of the technologies behind SDHI, the impact of these technologies, and the associated challenges and potential future directions of SDHI.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2018
Keywords
software-defined infrastructure, resource disaggregation, cloud infrastructure, rack-scale, hyperscale computing, disaggregated DC
National Category
Communication Systems
Identifiers
urn:nbn:se:kth:diva-235270 (URN)10.1109/COMST.2018.2834731 (DOI)000443030500033 ()2-s2.0-85046804138 (Scopus ID)
Funder
Swedish Foundation for Strategic ResearchKnut and Alice Wallenberg Foundation
Note

QC 20180919

Available from: 2018-09-19 Created: 2018-09-19 Last updated: 2024-02-27Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0003-4088-7884

Search in DiVA

Show all publications