kth.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Application-level Chaos Engineering
KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Theoretical Computer Science, TCS.ORCID iD: 0000-0002-7211-3894
2022 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

With the development of software techniques, software systems nowadays are becoming highly complex. In order to keep such systems as reliable as possible, developers need to design various error-handling mechanisms. Considering that the error-handling code needs to work properly in production, it should not only be tested offline but also verified in production after deploying the system. Chaos engineering is a technique that assesses a software system's error-handling mechanisms in production directly. In order to apply chaos engineering, developers first monitor the target system and identify its steady state. Then specific failures are injected in a controlled manner so that the system's error-handling code is triggered and analyzed. By comparing the observed behavior during a chaos engineering experiment with the steady state, developers confirm whether the designed error-handling mechanisms work as expected.

In the field of chaos engineering, there still exist technical challenges that affect the effectiveness of the approach. This thesis makes contributions to the following three open challenges in chaos engineering.

First of all, as chaos engineering experiments are done in production, it is important to improve the efficiency of these experiments. In order to reduce unrealistic experiments, we propose a new approach that synthesizes chaos engineering fault models using the naturally happening errors in production.

Second, in order to analyze a system's steady state and detect its abnormal behavior during chaos engineering experiments, sufficient observability is the key. We propose a multi-layer observability improvement solution for Dockerized Java applications. With the help of our solution, developers are able to improve an application's observability at the operating system level, the runtime environment level, and the application level, with limited effort.

Last, chaos engineering should be helpful to locate actual places for resilience improvements. We propose three fault injection approaches that apply chaos engineering at the application level to take domain-specific knowledge into consideration.

Abstract [sv]

Modern mjukvarusystem blir allt mer komplex i samband med ett växande behov av digitalisering i samhället. För att bibehålla tillförlitligheten gentemot det ökande komplexitetet i systemet, löser utvecklare detta genom att introducera olika felhantering mekanismer som utvärderas med rigorösa tester och verifieringar både i test- och produktionsmiljö för säkerställandet av ett fungerade system. För utvärdering av produktionsmiljö, kan kaosteknik appliceras genom att först övervaka systemet för identifiering av dess stabila tillstånd. Sedan kan specifika och kontrollerbara fel injiceras för att aktivera de nämnda felhantering mekanismer som producerar utdata. Denna utdatan fångas upp för att jämföra med de beteende hos systemet i sitt stabila tillstånd i målet att verifiera om mekanismerna fungerar som förväntas.

Inom området kaosteknik finns det fortfarande tekniska utmaningar som påverkar effektiviteten. Denna avhandling ger bidrag till följande tre öppna utmaningar inom kaosteknik.

Först och främst, eftersom kaostekniska experiment görs i produktionen, är det viktigt att förbättra effektiviteten i dessa experiment. För att minska orealistiska experiment föreslår vi ett nytt tillvägagångssätt som syntetiserar kaostekniska felmodeller med hjälp av de naturliga felen i produktionen.

För det andra, för att analysera ett systems steady state och upptäcka dess onormala beteende under kaostekniska experiment, är tillräcklig observerbarhet nyckeln. Vi föreslår en lösning för förbättring av observerbarhet i flera lager för Dockeriserade Java-applikationer. Med hjälp av vår lösning kan utvecklare förbättra en applikations observerbarhet på operativsystemnivå, körtidsmiljönivå och applikationsnivå, med begränsad ansträngning.

Till sist borde kaosteknik vara till hjälp för att hitta faktiska platser för förbättringar av motståndskraften. Vi föreslår tre felinjektionsmetoder som tillämpar kaosteknik på applikationsnivå för att ta hänsyn till domänspecifik kunskap.

Place, publisher, year, edition, pages
Stockholm: KTH Royal Institute of Technology, 2022. , p. vi, 71
Series
TRITA-EECS-AVL ; 2022:57
Keywords [en]
fault injection, dynamic analysis, software resilience, chaos engineering
National Category
Computer Sciences
Research subject
Computer Science
Identifiers
URN: urn:nbn:se:kth:diva-320638ISBN: 978-91-8040-347-4 (print)OAI: oai:DiVA.org:kth-320638DiVA, id: diva2:1707948
Public defence
2022-11-29, Zoom: https://kth-se.zoom.us/j/61717169026, F3, Lindstedtsvägen 26 & 28, Stockholm, 09:00 (English)
Opponent
Supervisors
Funder
Wallenberg AI, Autonomous Systems and Software Program (WASP)
Note

QC 20221102

Available from: 2022-11-02 Created: 2022-11-02 Last updated: 2022-11-09Bibliographically approved
List of papers
1. Maximizing Error Injection Realism for Chaos Engineering with System Calls
Open this publication in new window or tab >>Maximizing Error Injection Realism for Chaos Engineering with System Calls
2021 (English)In: IEEE Transactions on Dependable and Secure Computing, ISSN 1545-5971, E-ISSN 1941-0018Article in journal (Refereed) Published
Abstract [en]

In this paper, we present a novel fault injection framework for system call invocation errors, called Phoebe. Phoebe is unique as follows; First, Phoebe enables developers to have full observability of system call invocations. Second, Phoebe generates error models that are realistic in the sense that they mimic errors that naturally happen in production. Third, Phoebe is able to automatically conduct experiments to systematically assess the reliability of applications with respect to system call invocation errors in production. We evaluate the effectiveness and runtime overhead of Phoebe on two real-world applications in a production environment. The results show that Phoebe successfully generates realistic error models and is able to detect important reliability weaknesses with respect to system call invocation errors. To our knowledge, this novel concept of "realistic error injection", which consists of grounding fault injection on production errors, has never been studied before.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2021
Keywords
fault injection, system call, chaos engineering
National Category
Software Engineering
Identifiers
urn:nbn:se:kth:diva-275718 (URN)10.1109/TDSC.2021.3069715 (DOI)000822380300001 ()2-s2.0-85103793181 (Scopus ID)
Funder
Wallenberg AI, Autonomous Systems and Software Program (WASP)
Note

QC 20210407

Available from: 2020-06-09 Created: 2020-06-09 Last updated: 2022-11-02Bibliographically approved
2. A Chaos Engineering System for Live Analysis and Falsification of Exception-handling in the JVM
Open this publication in new window or tab >>A Chaos Engineering System for Live Analysis and Falsification of Exception-handling in the JVM
Show others...
2021 (English)In: IEEE Transactions on Software Engineering, ISSN 0098-5589, E-ISSN 1939-3520, Vol. 47, no 11, p. 2534-2548Article in journal (Refereed) Published
Abstract [en]

Software systems contain resilience code to handle those failures and unexpected events happening in production. It is essential for developers to understand and assess the resilience of their systems. Chaos engineering is a technology that aims at assessing resilience and uncovering weaknesses by actively injecting perturbations in production. In this paper, we propose a novel design and implementation of a chaos engineering system in Java called ChaosMachine. It provides a unique and actionable analysis on exception-handling capabilities in production, at the level of try-catch blocks. To evaluate our approach, we have deployed ChaosMachine on top of 3 large-scale and well-known Java applications totaling 630k lines of code. Our results show that ChaosMachine reveals both strengths and weaknesses of the resilience code of a software system at the level of exception handling.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2021
Keywords
dynamic analysis, exception-handling, production systems, chaos engineering
National Category
Software Engineering
Identifiers
urn:nbn:se:kth:diva-283674 (URN)10.1109/TSE.2019.2954871 (DOI)000717767100014 ()2-s2.0-85119594677 (Scopus ID)
Funder
Wallenberg AI, Autonomous Systems and Software Program (WASP)
Note

QC 20230630

Available from: 2020-10-10 Created: 2020-10-10 Last updated: 2023-06-30Bibliographically approved
3. TripleAgent: Monitoring, Perturbation and Failure-Obliviousness for Automated Resilience Improvement in Java Applications
Open this publication in new window or tab >>TripleAgent: Monitoring, Perturbation and Failure-Obliviousness for Automated Resilience Improvement in Java Applications
2019 (English)In: Proceedings - International Symposium on Software Reliability Engineering, ISSRE, Institute of Electrical and Electronics Engineers (IEEE) , 2019, p. 116-127Conference paper, Published paper (Refereed)
Abstract [en]

In this paper, we present a novel resilience improvement system for Java applications. The unique feature of this system is to combine automated monitoring, automated perturbation injection, and automated resilience improvement. The latter is achieved thanks to the failure-oblivious computing, a concept introduced in 2004 by Rinard and colleagues. We design and implement the system as agents for the Java virtual machine. We evaluate the system on two real-world applications: a file transfer client and an email server. Our results show that it is possible to automatically improve the resilience of Java applications with respect to uncaught or mishandled exceptions.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2019
Series
Advances in Neural Information Processing Systems, ISSN 1049-5258
Keywords
Dynamic analysis, Exception handling, Fault injection, Software resilience, Automation, Software reliability, Automated monitoring, Design and implements, E-mail servers, Java applications, Java virtual machines, Unique features, Java programming language
National Category
Computer and Information Sciences
Identifiers
urn:nbn:se:kth:diva-274763 (URN)10.1109/ISSRE.2019.00021 (DOI)000542117600011 ()2-s2.0-85081101109 (Scopus ID)
Conference
30th IEEE International Symposium on Software Reliability Engineering, ISSRE 2019, 28-31 October 2019, Berlin, Germany
Funder
Wallenberg AI, Autonomous Systems and Software Program (WASP)
Note

QC 20200624

Part of ISBN 9781728149813

Available from: 2020-06-24 Created: 2020-06-24 Last updated: 2024-10-21Bibliographically approved
4. Observability and Chaos Engineering on System Calls for Containerized Applications in Docker
Open this publication in new window or tab >>Observability and Chaos Engineering on System Calls for Containerized Applications in Docker
Show others...
2021 (English)In: Future Generation Computer Systems, ISSN 0167-739X, E-ISSN 1872-7115, Vol. 122, p. 117-129Article in journal, Editorial material (Refereed) Published
Abstract [en]

In this paper, we present a novel fault injection system called ChaosOrca for system calls in containerized applications. ChaosOrca aims at evaluating a given application's self-protection capability with respect to system call errors. The unique feature of ChaosOrca is that it conducts experiments under production-like workload without instrumenting the application. We exhaustively analyze all kinds of system calls and utilize different levels of monitoring techniques to reason about the behaviour under perturbation. We evaluate ChaosOrca on three real-world applications: a file transfer client, a reverse proxy server and a micro-service oriented web application. Our results show that it is promising to detect weaknesses of resilience mechanisms related to system calls issues.

Place, publisher, year, edition, pages
Elsevier BV, 2021
Keywords
fault injection, chaos engineering, system call, containers, observability
National Category
Software Engineering
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-263699 (URN)10.1016/j.future.2021.04.001 (DOI)000652613800011 ()2-s2.0-85104285116 (Scopus ID)
Funder
Wallenberg AI, Autonomous Systems and Software Program (WASP)Swedish Foundation for Strategic Research , trustfull
Note

QC 20210614

Available from: 2019-11-08 Created: 2019-11-08 Last updated: 2024-09-04Bibliographically approved
5. Automatic Observability for Dockerized Java Applications
Open this publication in new window or tab >>Automatic Observability for Dockerized Java Applications
Show others...
(English)Manuscript (preprint) (Other academic)
Abstract [en]

Docker is a virtualization technique heavily used in industry to build cloud-based systems. In this context, observability means that it is hard for engineers to get timely and accurate information about the running state in production, due to scale and virtualization. In this paper, we present a novel approach, called POBS, to automatically improve observability of Dockerized Java applications. POBS is based on automated transformations of Docker configuration files. Our approach injects additional modules in the production application, for providing better observability and for supporting fault injection. We evaluate POBS with open-source Java applications. Our key result is that 564/880 (64%) of Docker configuration files can be automatically augmented with better observability. This calls for more research on automated transformation techniques in the Docker ecosystem.

Keywords
observability, fault injection, dynamic analysis, software resilience, Docker
National Category
Software Engineering
Identifiers
urn:nbn:se:kth:diva-275717 (URN)
Funder
Wallenberg AI, Autonomous Systems and Software Program (WASP)
Note

QC 20210113

Available from: 2020-06-09 Created: 2020-06-09 Last updated: 2022-11-02Bibliographically approved
6. Chaos Engineering of Ethereum Blockchain Clients
Open this publication in new window or tab >>Chaos Engineering of Ethereum Blockchain Clients
(English)Manuscript (preprint) (Other (popular science, discussion, etc.))
Abstract [en]

The Ethereum blockchain is the operational backbone of major decentralized finance platforms. As such, it is expected to be exceptionally reliable. In this paper, we present ChaosETH, a chaos engineering tool for resilience assessment of Ethereum clients. ChaosETH operates in the following manner: First, it monitors Ethereum clients to determine their normal behavior. Then, it injects system call invocation errors into the Ethereum clients and observes the resulting behavior under perturbation. Finally, ChaosETH compares the behavior recorded before, during, and after perturbation to assess the impact of the injected system call invocation errors. The experiments are performed on the two most popular Ethereum client implementations: GoEthereum and OpenEthereum. We experiment with 22 different types of system call invocation errors. We assess their impact on the Ethereum clients with respect to 15 application-level metrics. Our results reveal a broad spectrum of resilience characteristics of Ethereum clients in the presence of system call invocation errors, ranging from direct crashes to full resilience. The experiments clearly demonstrate the feasibility of applying chaos engineering principles to blockchains.

Keywords
chaos engineering, Ethereum, fault injection, resilience benchmarking
National Category
Software Engineering
Identifiers
urn:nbn:se:kth:diva-309113 (URN)10.48550/arXiv.2111.00221 (DOI)
Funder
Wallenberg AI, Autonomous Systems and Software Program (WASP)
Note

QC 20220222

Available from: 2022-02-21 Created: 2022-02-21 Last updated: 2022-11-02Bibliographically approved

Open Access in DiVA

summary(1134 kB)1077 downloads
File information
File name SUMMARY01.pdfFile size 1134 kBChecksum SHA-512
09aa757c169a338840603554747ddb9ab04c0f70fd3d9480e7a4d6c8bd6964c13ce4cc4c0e8739dc4601c02a8f97acfaf7f70024a5f5945aeb9279a2877376fe
Type summaryMimetype application/pdf

Authority records

Zhang, Long

Search in DiVA

By author/editor
Zhang, Long
By organisation
Theoretical Computer Science, TCS
Computer Sciences

Search outside of DiVA

GoogleGoogle Scholar
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 2266 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf