Change search
Link to record
Permanent link

Direct link
BETA
Publications (5 of 5) Show all publications
Bogdanov, K. (2018). Enabling Fast and Accurate Run-Time Decisions in Geo-Distributed Systems: Better Achieving Service Level Objectives. (Doctoral dissertation). Stockholm: KTH Royal Institute of Technology
Open this publication in new window or tab >>Enabling Fast and Accurate Run-Time Decisions in Geo-Distributed Systems: Better Achieving Service Level Objectives
2018 (English)Doctoral thesis, monograph (Other academic)
Abstract [en]

Computing services are highly integrated into modern society and used  by millions of people daily. To meet these high demands, many popular  services are implemented and deployed as geo-distributed applications on  top of third-party virtualized cloud providers. However, the nature of  such a deployment leads to variable performance. To deliver high quality  of service, these systems strive to adapt to ever-changing conditions by  monitoring changes in state and making informed run-time decisions, such  as choosing server peering, replica placement, and redirection of requests. In  this dissertation, we seek to improve the quality of run-time decisions made  by geo-distributed systems. We attempt to achieve this through: (1) a better  understanding of the underlying deployment conditions, (2) systematic and  thorough testing of the decision logic implemented in these systems, and (3)  by providing a clear view of the network and system states allowing services  to make better-informed decisions.  First, we validate an application’s decision logic used in popular  storage systems by examining replica selection algorithms. We do this by  introducing GeoPerf, a tool that uses symbolic execution and modeling to  perform systematic testing of replica selection algorithms. GeoPerf was used  to test two popular storage systems and found one bug in each.  Then, using measurements across EC2, we observed persistent correlation  between network paths and network latency. Based on these observations,  we introduce EdgeVar, a tool that decouples routing and congestion based  changes in network latency. This additional information improves estimation  of latency, as well as increases the stability of network path selection.  Next, we introduce Tectonic, a tool that tracks an application’s requests  and responses both at the user and kernel levels. In combination with  EdgeVar, it decouples end-to-end request completion time into three  components of network routing, network congestion, and service time.  Finally, we demonstrate how this decoupling of request completion  time components can be leveraged in practice by developing Kurma, a  fast and accurate load balancer for geo-distributed storage systems. At  runtime, Kurma integrates network latency and service time distributions to  accurately estimate the rate of Service Level Objective (SLO) violations, for  requests redirected between geo-distributed datacenters. Using real-world  data, we demonstrate Kurma’s ability to effectively share load among  datacenters while reducing SLO violations by a factor of up to 3 in high  load settings or reducing the cost of running the service by up to 17%. The  techniques described in this dissertation are important for current and future  geo-distributed services that strive to provide the best quality of service to  customers while minimizing the cost of operating the service.  

Abstract [sv]

Databehandlingstjänster är en välintegrerad del av det moderna samhället  och används av miljontals människor dagligen. För att möta deras höga krav  implementeras och placeras många populära tjänster som geodistribuerade  applikationer ovanpå tredje parters virtuella molntjänster. Det ligger emellertid  i sakens natur att sådana utplaceringar resulterar i varierande prestanda. För att  leverera hög servicekvalitet behöver sådana system sträva efter att ständigt anpassa  sig efter ändrade förutsättningar genom att övervaka ändringar i tillstånd och ta  informerade realtidsbeslut, som till exempel val av server att pira, replikaplacering,  och omdirigering av förfrågningar. Den här avhandlingen avser att förbättra  kvaliteten på realtidsbeslut tagna av geodistribuerade system. Vi försöker uppnå  detta genom: (1) en bättre förståelse av underliggande utplaceringsvillkor, (2)  systematisk och noggrann testning av beslutslogik redan implementerad i dessa  system, och (3) genom att förse en tydlig inblick i nätverket och systemtillstånd  som tillåter dessa tjänster att utföra mer informerade beslut.  Vi börjar med att validera en applikations beslutslogik vanlig i populära  lagringssystem genom att undersöka valalgoritmen för replikas. Vi gör detta genom  att införa GeoPerf, ett verktyg som tillämpar symbolisk exekvering och modellering  för systematisk testning av sådana valalgoritmer. GeoPerf användes för att testa  två populära lagringssystem och hittade en bugg i båda.  Genom mätningar över EC2 observerar vi sedan en beständig korrelation  mellan nätverksvägar och nätverkslatens. Baserat på dessa observationer introducerar  vi EdgeVar, ett verktyg som frikopplar dirigering och trängsel baserat på  förändringar i nätverkslatens. Denna ytterligare nformation förbättrar kvaliteten  på latensuppskattningen samt förbättrar stabiliteten på nä verkets val av väg.  Därpå introducerar vi Tectonic, ett verktyg som följer en applikations begäran  och gensvar på både användar- och kernelnivå. Tillsammans med EdgeVar kan  totalsträckstiden från begäran till avslut delas upp i tre delar bestående av  nätverksdirigering, trängsel och servicetid.  Slutligen demonstrerar vi hur denna uppdelning av totalsträckstiden kan  utnyttjas i praktiken genom att utveckla Kurma, en snabb och noggrann  lastbalanserare för geodistribuerade lagringssystem. Vid exekvering integrerar  Kurma nätverksfördröjning och servicetidsfördelningar för att noggrant uppskatta  graden av servicenivåmålsöverträdelser, SLO, för förfrågningar som omdirigeras  mellan geodistribuerade datacenters. Genom användning av realtidsdata demonstrerar  vi Kurmas förmåga att effektivt fördela lasten mellan datacenters samtidigt som  SLO-överträdelser minskas med upp till en faktor tre vid hög belastning eller  minskar kostnaden för att köra tjänsten med upp till 17%. Teknikerna som beskrivs  i denna avhandling kan anses viktiga för nuvarande och framtida geodistribuerade  tjänster som strävar efter att tillhandahålla den bästa servicekvalitén till användarna samtidigt som driftskostnaden för att driva tjänsterna minimeras.  

Place, publisher, year, edition, pages
Stockholm: KTH Royal Institute of Technology, 2018. p. 184
Series
TRITA-EECS-AVL ; 2018:78
Keywords
Cloud Computing, Geo-Distributed Systems, Replica Selection Algorithms., Molntjänster, Geodistribuerade system, Valalgoritmer för replikas
National Category
Communication Systems
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-238665 (URN)978-91-7729-985-1 (ISBN)
Public defence
2018-11-26, Sal-B, Kistagången 16, Electrum 1, våningsplan 2, Kista, 10:00 (English)
Opponent
Supervisors
Note

QC 20181101

Available from: 2018-11-07 Created: 2018-11-07 Last updated: 2018-11-19Bibliographically approved
Bogdanov, K., Reda, W., Maguire Jr., G. Q., Kostic, D. & Canini, M. (2018). Fast and accurate load balancing for geo-distributed storage systems. In: SoCC 2018 - Proceedings of the 2018 ACM Symposium on Cloud Computing: . Paper presented at 2018 ACM Symposium on Cloud Computing, SoCC 2018, Carlsbad, United States, 11 October 2018 through 13 October 2018 (pp. 386-400). Association for Computing Machinery (ACM)
Open this publication in new window or tab >>Fast and accurate load balancing for geo-distributed storage systems
Show others...
2018 (English)In: SoCC 2018 - Proceedings of the 2018 ACM Symposium on Cloud Computing, Association for Computing Machinery (ACM), 2018, p. 386-400Conference paper, Published paper (Refereed)
Abstract [en]

The increasing density of globally distributed datacenters reduces the network latency between neighboring datacenters and allows replicated services deployed across neighboring locations to share workload when necessary, without violating strict Service Level Objectives (SLOs). We present Kurma, a practical implementation of a fast and accurate load balancer for geo-distributed storage systems. At run-time, Kurma integrates network latency and service time distributions to accurately estimate the rate of SLO violations for requests redirected across geo-distributed datacenters. Using these estimates, Kurma solves a decentralized rate-based performance model enabling fast load balancing (in the order of seconds) while taming global SLO violations. We integrate Kurma with Cassandra, a popular storage system. Using real-world traces along with a geo-distributed deployment across Amazon EC2, we demonstrate Kurma’s ability to effectively share load among datacenters while reducing SLO violations by up to a factor of 3 in high load settings or reducing the cost of running the service by up to 17%.

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM), 2018
Keywords
Cloud Computing, Distributed Systems, Server Load Balancing, Service Level Objectives, Wide Area Networks
National Category
Communication Systems
Identifiers
urn:nbn:se:kth:diva-241481 (URN)10.1145/3267809.3267820 (DOI)2-s2.0-85059006718 (Scopus ID)9781450360111 (ISBN)
Conference
2018 ACM Symposium on Cloud Computing, SoCC 2018, Carlsbad, United States, 11 October 2018 through 13 October 2018
Funder
EU, Horizon 2020, 770889Swedish Foundation for Strategic Research
Note

QC 20190123

Available from: 2019-01-23 Created: 2019-01-23 Last updated: 2019-04-29Bibliographically approved
Bogdanov, K., Reda, W., Kostic, D., Maguire Jr., G. Q. & Canini, M. (2018). Kurma: Fast and Efficient Load Balancing for Geo-Distributed Storage Systems: Evaluation of Convergence and Scalability.
Open this publication in new window or tab >>Kurma: Fast and Efficient Load Balancing for Geo-Distributed Storage Systems: Evaluation of Convergence and Scalability
Show others...
2018 (English)Report (Other academic)
Abstract [en]

This report provides an extended evaluation of Kurma, a practical implementation of a geo-distributed load balancer for backend storage systems. In this report we demonstrate the ability of distributed Kurma instances to accurately converge to the same solutions within 1% of the total datacenter’s capacity and the ability of Kurma to scale up to 8 datacenters using a single CPU core at each datacenter.

National Category
Communication Systems
Identifiers
urn:nbn:se:kth:diva-222289 (URN)
Note

QR 20180212

Available from: 2018-02-05 Created: 2018-02-05 Last updated: 2018-02-12Bibliographically approved
Bogdanov, K., Peón-Quirós, M., Maguire Jr., G. Q. & Kostic, D. (2015). The Nearest Replica Can Be Farther Than You Think. In: Proceedings of the ACM Symposium on Cloud Computing 2015: . Paper presented at ACM Symposium on Cloud Computing August 27 - 29, 2015,Hawaii (pp. 16-29). Association for Computing Machinery (ACM)
Open this publication in new window or tab >>The Nearest Replica Can Be Farther Than You Think
2015 (English)In: Proceedings of the ACM Symposium on Cloud Computing 2015, Association for Computing Machinery (ACM), 2015, p. 16-29Conference paper, Published paper (Refereed)
Abstract [en]

Modern distributed systems are geo-distributed for reasons of increased performance, reliability, and survivability. At the heart of many such systems, e.g., the widely used Cassandra and MongoDB data stores, is an algorithm for choosing a closest set of replicas to service a client request. Suboptimal replica choices due to dynamically changing network conditions result in reduced performance as a result of increased response latency. We present GeoPerf, a tool that tries to automate the process of systematically testing the performance of replica selection algorithms for geodistributed storage systems. Our key idea is to combine symbolic execution and lightweight modeling to generate a set of inputs that can expose weaknesses in replica selection. As part of our evaluation, we analyzed network round trip times between geographically distributed Amazon EC2 regions, and showed a significant number of daily changes in nearestK replica orders. We tested Cassandra and MongoDB using our tool, and found bugs in each of these systems. Finally, we use our collected Amazon EC2 latency traces to quantify the time lost due to these bugs. For example due to the bug in Cassandra, the median wasted time for 10% of all requests is above 50 ms.

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM), 2015
Keywords
Geo-Distributed Systems, Replica Selection Algorithms, Symbolic Execution
National Category
Communication Systems Computer Systems Computer Sciences
Identifiers
urn:nbn:se:kth:diva-171434 (URN)10.1145/2806777.2806939 (DOI)000380606400002 ()2-s2.0-84958960133 (Scopus ID)
External cooperation:
Conference
ACM Symposium on Cloud Computing August 27 - 29, 2015,Hawaii
Funder
EU, European Research Council, 259110
Note

To obtain the data used in this work please contact dmk@kth.se and kirillb@kth.se.

QC 20150812

Available from: 2015-08-03 Created: 2015-08-03 Last updated: 2018-10-07Bibliographically approved
Bogdanov, K., Peón-Quirós, M., Maguire Jr., G. Q. & Kostić, D. (2015). Toward Automated Testing of Geo-Distributed Replica Selection Algorithms. In: Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication: . Paper presented at 2015 ACM Conference on Special Interest Group on Data Communication (pp. 89-90). Association for Computing Machinery (ACM)
Open this publication in new window or tab >>Toward Automated Testing of Geo-Distributed Replica Selection Algorithms
2015 (English)In: Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication, Association for Computing Machinery (ACM), 2015, p. 89-90Conference paper, Poster (with or without abstract) (Refereed)
Abstract [en]

Many geo-distributed systems rely on a replica selection algorithms to communicate with the closest set of replicas.  Unfortunately, the bursty nature of the Internet traffic and ever changing network conditions present a problem in identifying the best choices of replicas. Suboptimal replica choices result in increased response latency and reduced system performance. In this work we present GeoPerf, a tool that tries to automate testing of geo-distributed replica selection algorithms. We used GeoPerf to test Cassandra and MongoDB, two popular data stores, and found bugs in each of these systems.

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM), 2015
Series
SIGCOMM ’15
Keywords
replica selection algorithms, software testing and debugging, symbolic execution, wide area networks
National Category
Communication Systems
Identifiers
urn:nbn:se:kth:diva-173882 (URN)10.1145/2785956.2790013 (DOI)000370556200009 ()2-s2.0-84962237699 (Scopus ID)
Conference
2015 ACM Conference on Special Interest Group on Data Communication
Funder
EU, European Research Council, ERC project 259110.
Note

QC 20150923. QC 20160407

Available from: 2015-09-22 Created: 2015-09-22 Last updated: 2016-04-07Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0001-7642-6591

Search in DiVA

Show all publications