Misconfigurations (a.k.a., configuration errors from a system’s standpoint) are among the dominant causes of today’s catastrophic system failures that turn down cloud-scale services and affect hundreds of millions of end users. Despite their wide adoption, traditional fault-tolerance and failure-recovery techniques are not effective in dealing with configuration errors, especially in large-scale software systems deployed in cloud and datacenters. To make the matters worse, even the tolerance and recovery mechanisms themselves are often misconfigured in the real world, which impairs the immune system of the entire cloud and datacenters.
This dissertation explores two fundamental questions towards the solutions for the inevitable misconfigurations—how to build reliable cloud and datacenter systems in the face of configuration errors; moreover, how to prevent misconfigurations in the first place by better configuration design. The goal is to enable software systems to proactively anticipate and defend against misconfigurations, rather than reacting to their manifestations and consequences.
This dissertation presents three key principles of systems design and implementation for hardening cloud and datacenter systems against misconfigurations—anticipating misconfigurations, early detection of configuration errors, and simplicity-oriented configuration design. The dissertation demonstrates that applying these principles can effectively defend cloud and datacenter systems against misconfigurations. Moreover, the dissertation presents the corresponding techniques and tool support that can automatically and systematically apply these principles to existing systems software.
The main technical insight is that configurations are essentially used by the systems, while configuration errors are mostly manifested through the faulty execution that uses erroneous configuration values. Therefore, by analyzing the system’s code that uses
configuration values, one can understand and make use of system-level information of configurations to build defense against potential errors. This dissertation first presents Spex that enables systems to anticipate misconfigurations. Spex automatically infers configuration constraints from a system’s source code, and then leverages the constraints to test the system’s resilience to misconfigurations and detect error-prone configuration design/handling. On step further, the dissertation introduces PCheck to automatically generate checking code which captures configuration errors at the system’s initialization phase to prevent their late manifestations and the corresponding failure damage.
Going beyond, this dissertation presents simplicity-oriented configuration design towards more usable and less error-prone software configuration. The key idea is to apply the user-centric philosophy to design configuration as an interface—configurations are essentially the interface for controlling and customizing system behavior, but have rarely been treated as it is. The dissertation shows that configurations in today’s systems software can be significantly simplified and effectively navigated, with the understanding of how they are actually used in the field.