Search

Scholarly Works (5 results)

Sort By:

Thesis
Peer Reviewed

Practical Dependable Systems with OS/Hypervisor Support

Zhou, Diyu
Advisor(s): Tamir, Yuval

UCLA Electronic Theses and Dissertations (2020)

Critical applications require dependability mechanisms to prevent them from failuresdue to faults. Dependable systems for mainstream deployment are typically built upon commodity hardware with mechanisms that enhance resilience implemented in software. Such systems are aimed at providing commercially viable, best-effort dependability cost- effectively.

This thesis proposes several practical, low-overhead dependability mechanisms for criticalcomponents in the system: hypervisors, containers, and parallel applications.

For hypervisors, the latency to reboot a new instance to recover from transient faults isunacceptably high. NiLiHype recovers the hypervisor by resetting it to a quiescent state that is highly likely to be valid. Compared to a prior work based on reboot, NiLiHype reduces the service interruption time during recovery from 713ms to 22ms, a factor of over 30x, while achieving nearly the same recovery success rate.

NiLiCon, to the best of our knowledge, is the first replication mechanism for commercialoff-the-shelf containers. NiLiCon is based on high-frequency incremental checkpointing to a warm spare, previously used for VMs. A key implementation challenge is that, compared to a VM, there is a much tighter coupling between the container state and the state of the underlying platform. NiLiCon meets this challenge with various enhancements and achieves performance that is competitive with VM replication.

HyCoR enhances NiLiCon with deterministic replay to address a fundamental drawbackof high-frequency replication techniques: unacceptably long delay of outputs to clients. With deterministic replay, HyCoR decouples latency overhead from the checkpointing interval. For a set of eight benchmarks, with HyCoR, the latency overhead is reduced from tens of milliseconds to less than 600us. For data race-free applications, the throughput overhead of HyCoR is only 2%-58%.

PUSh is a dynamic data race detector based on detecting violations of the intended sharing of objects, specified by the programmer. PUSh leverages existing memory protection hardware to detect such violations. Specifically, a key optimization in PUSh exploits memory protection keys, a hardware feature recently added to the x86 ISA. Several other key optimizations are achieved by enhancing the Linux kernel. For a set of eleven benchmarks, PUSh's memory overhead is less than 5.8% and performance overhead is less than 54%.

Cover page: Practical Dependable Systems with OS/Hypervisor Support

Thesis
Peer Reviewed

Resilient Virtualized Systems

Le, Michael Vu
Advisor(s): Tamir, Yuval

UCLA Electronic Theses and Dissertations (2014)

System virtualization allows for

the consolidation of many physical

servers on a single physical host by running the

workload of each physical server

inside a Virtual Machine (VM).

This is facilitated by a set of software components,

that we call the Virtualization Infrastructure (VI),

responsible for managing and multiplexing

physical resources among VMs.

While server consolidation using system virtualization

can greatly improve

the utilization of resources,

reliability becomes a major concern as

failure of the VI

due to hardware or software faults

can result in the failure of

all VMs running on the system.

The focus of this dissertation is on

the design and implementation of

mechanisms that enhance the

resiliency of the virtualized system

by way of enhancing the resiliency

of the VI to transient hardware and software faults.

Given that the use of hardware redundancy

can be costly, one of the main goals of this work

is to achieve high reliability

using purely software-based techniques.

The main approach for providing

resiliency to VI failures used in

this work is

to partition the VI into subcomponents and

provide mechanisms to detect and recover

each failed VI component

transparently to the running VMs.

These resiliency mechanisms are developed

incrementally using results from fault injection

to identify dangerous state corruptions

and inconsistencies between the recovered

and existing components in the system.

A prototype containing mechanisms proposed

in this dissertation is implemented

on top of the widely-used Xen virtualized system.

In this prototype, three different

recovery mechanisms are developed

for each Xen VI component:

the virtual machine monitor (VMM),

driver VM (DVM), and

privileged VM (PrivVM).

With the proposed

resiliency mechanisms,

applications can continue to correctly provide

services over 86% of detected VMM failures

and over 96% of detected DVM and PrivVM failures.

The proposed mechanisms

require no modifications to applications running

in the VMs and

minimal amount of modifications to the VI.

These mechanisms are light-weight and can

operate with minimal CPU and memory overhead

during normal system operations.

The mechanisms in this work do not

rely on redundant hardware but can

make use of redundant resources

to achieve, in many instances,

sub-millisecond recovery latency.

Cover page: Resilient Virtualized Systems

Thesis
Peer Reviewed

Hypervisor Side Cache for Virtual Desktop Infrastructure

Sakdeo, Sumedh Vivek
Advisor(s): Tamir, Yuval

UCLA Electronic Theses and Dissertations (2012)

Virtual Desktop Infrastructure (VDI) deployments run large numbers of desktops in a virtualized environment to increase flexibility and address cost. One of the major challenges VDI faces today is the cost of high bandwidth interconnection networks to shared storage. VDI storage workloads have a number of unique characteristics which make them a target for optimization. For example, VDI workloads exhibit high amount of redundant data transfers (from shared OS images), highly bursty behavior (from daily work patterns), and a common storage format (virtual disks).

This thesis performs a detailed study of VDI workload and evaluates effectiveness of four hypervisor side optimization techniques. To eliminate network read requests and serve data from locally cached blocks, we evaluate two read caches, namely, location-addressed and content-addressed. We also compare these read cache with a simple mechanism which stores shared read-only virtual disks on hypervisor side local media. To eliminate transfer of redundant data that is written to the storage server, we evaluate the effectiveness of inline write deduplication. All the experiments are carried out in two setting, for full clone virtual desktops and linked clone virtual desktops.

A detailed trace-driven simulation study of the mechanisms with a realistic VDI workload shows up to 75% reduction in the total network I/O traffic. We propose some recommended setting for choosing the right optimizations, for example, for full clone virtual desktops content-addressed cache outperforms location-addressed cache by 50%.

Cover page: Hypervisor Side Cache for Virtual Desktop Infrastructure

Thesis
Peer Reviewed

Design and Validation of a Layered Approach to Fault Tolerance for Distributed Applications

Hsu, Israel Yi-Hsin
Advisor(s): Tamir, Yuval

UCLA Electronic Theses and Dissertations (2014)

Clusters of message-passing computing nodes provide high-performance platforms for distributed applications. Cost-effective implementations of such systems are based on commercial off-the-shelf (COTS) hardware and software components. One trend in the deployment of such systems is to scale up the number of compute nodes to deliver higher performance levels. The higher component count results in a corresponding higher rate of failure. Another trend is to deploy clusters for mission-critical applications or in harsh environments, where reliability requirements are higher than in a controlled lab setting. Both of these trends point to an increasing need to employ fault tolerance techniques to meet the reliability requirements of the applications being executed.

We present a layered approach to providing fault tolerance for message-passing applications on compute clusters that are based on COTS hardware components, COTS operating systems, and a COTS API for application programmers. This approach relies on highly-resilient cluster management middleware (CMM) that ensures the survival of key system services despite the failure of cluster components. A key feature of this CMM is that it provides services that enable and simplify user-level implementation of fault tolerance for applications without dictating the specific techniques employed. In particular, while application-transparent techniques are supported, the CMM also supports application-specific techniques that are tailored and optimized for the characteristics and requirements of specific applications. To this end, we have developed an API that can be used in the implementation of fault tolerance by the application programmer as well as by developers of user-level libraries that provide application-transparent fault tolerance.

The effectiveness of our layered approach is demonstrated and evaluated with several applications employing different techniques for fault tolerance. The entire system is subjected to a fault injection campaign. We show that the CMM services that support fault tolerance techniques operate reliably and with very low overhead. We also show that application-specific fault tolerance techniques detect and recover from a vast majority of manifested faults while imposing much lower performance overhead than application-transparent schemes.

Cover page: Design and Validation of a Layered Approach to Fault Tolerance for Distributed Applications

Article
Peer Reviewed

Predicting the risk of iliofemoral vascular complication in complex transfemoral-TAVR using new generation transcatheter devices.

UCLA Previously Published Works (2023)

OBJECTIVE: Design a predictive risk model for minimizing iliofemoral vascular complications (IVC) in a contemporary era of transfemoral-transcatheter aortic valve replacement (TF-TAVR). BACKGROUND: IVC remains a common complication of TF-TAVR despite the technological improvement in the new-generation transcatheter systems (NGTS) and enclosed poor outcomes and quality of life. Currently, there is no accepted tool to assess the IVC risk for calcified and tortuous vessels. METHODS: We reconstructed CT images of 516 propensity-matched TF-TAVR patients using the NGTS to design a predictive anatomical model for IVC and validated it on a new cohort of 609 patients. Age, sex, peripheral artery disease, valve size, and type were used to balance the matched cohort. RESULTS: IVC occurred in 214 (7.2%) patients. Sheath size (p = 0.02), the sum of angles (SOA) (p < .0001), number of curves (NOC) (p < .0001), minimal lumen diameter (MLD) (p < .001), and sheath-to-femoral artery diameter ratio (SFAR) (p = 0.012) were significant predictors for IVC. An indexed risk score (CSI) consisting of multiplying the SOA and NOC divided by the MLD showed 84.3% sensitivity and 96.8% specificity, when set to >100, in predicting IVC (C-stat 0.936, 95% CI 0.911-0.959, p < 0.001). Adding SFAR > 1.00 in a tree model increased the overall accuracy to 97.7%. In the validation cohort, the model predicted 89.5% of the IVC cases with an overall 89.5% sensitivity, 98.9% specificity, and 94.2% accuracy (C-stat 0.842, 95% CI 0.904-0.980, p < .0001). CONCLUSION: Our CT-based validated-model is the most accurate and easy-to-use tool assessing IVC risk and should be used for calcified and tortuous vessels in preprocedural planning.

Cover page: Predicting the risk of iliofemoral vascular complication in complex transfemoral-TAVR using new generation transcatheter devices.