Toward Online Testing of Federated and Heterogeneous Distributed Systems
2011 (English)In: Proceedings of The 2011 USENIX Annual Technical Conference, 2011Conference paper (Refereed)
Making distributed systems reliable is notoriously difficult. It is even more difficult to achieve high reliability for federated and heterogeneous systems, i.e., those that are operated by multiple administrative entities and have numerous inter-operable implementations. A prime example of such a system is the Internet’s inter-domain routing, today based on BGP.
We argue that system reliability should be improved by proactively identifying potential faults using an online testing functionality. We propose DiCE, an approach that continuously and automatically explores the system behavior, to check whether the system deviates from its desired behavior. DiCE orchestrates the exploration of relevant system behaviors by subjecting system nodes to many possible inputs that exercise node actions. DiCE starts exploring from current, live system state, and operates in isolation from the deployed system. We describe our experience in integrating DiCE with an opensource BGP router. We evaluate the prototype’s ability to quickly detect origin misconfiguration, a recurring operator mistake that causes Internet-wide outages. We also quantify DiCE’s overhead and find it to have marginal impact on system performance.
Place, publisher, year, edition, pages
IdentifiersURN: urn:nbn:se:kth:diva-147100OAI: oai:DiVA.org:kth-147100DiVA: diva2:727670
USENIX Annual Technical Conference,June 15-17 2011, Portland, OR, USA
Qc 201407042014-06-232014-06-232014-07-04Bibliographically approved