System virtualization allows for
the consolidation of many physical
servers on a single physical host by running the
workload of each physical server
inside a Virtual Machine (VM).
This is facilitated by a set of software components,
that we call the Virtualization Infrastructure (VI),
responsible for managing and multiplexing
physical resources among VMs.
While server consolidation using system virtualization
can greatly improve
the utilization of resources,
reliability becomes a major concern as
failure of the VI
due to hardware or software faults
can result in the failure of
all VMs running on the system.
The focus of this dissertation is on
the design and implementation of
mechanisms that enhance the
resiliency of the virtualized system
by way of enhancing the resiliency
of the VI to transient hardware and software faults.
Given that the use of hardware redundancy
can be costly, one of the main goals of this work
is to achieve high reliability
using purely software-based techniques.
The main approach for providing
resiliency to VI failures used in
this work is
to partition the VI into subcomponents and
provide mechanisms to detect and recover
each failed VI component
transparently to the running VMs.
These resiliency mechanisms are developed
incrementally using results from fault injection
to identify dangerous state corruptions
and inconsistencies between the recovered
and existing components in the system.
A prototype containing mechanisms proposed
in this dissertation is implemented
on top of the widely-used Xen virtualized system.
In this prototype, three different
recovery mechanisms are developed
for each Xen VI component:
the virtual machine monitor (VMM),
driver VM (DVM), and
privileged VM (PrivVM).
With the proposed
resiliency mechanisms,
applications can continue to correctly provide
services over 86% of detected VMM failures
and over 96% of detected DVM and PrivVM failures.
The proposed mechanisms
require no modifications to applications running
in the VMs and
minimal amount of modifications to the VI.
These mechanisms are light-weight and can
operate with minimal CPU and memory overhead
during normal system operations.
The mechanisms in this work do not
rely on redundant hardware but can
make use of redundant resources
to achieve, in many instances,
sub-millisecond recovery latency.