Search

Scholarly Works (9 results)

Sort By:

Thesis
Peer Reviewed

The Design and Implementation of Low-Latency Prediction Serving Systems

Crankshaw, Daniel
Advisor(s): Gonzalez, Joseph E

UC Berkeley Electronic Theses and Dissertations (2019)

Machine learning is being deployed in a growing number of applications which demand real- time, accurate, and cost-efficient predictions under heavy query load. These applications employ a variety of machine learning frameworks and models, often composing several models within the same application. However, most machine learning frameworks and systems are optimized for model training and not deployment.

In this thesis, I discuss three prediction serving systems designed to meet the needs of modern interactive machine learning applications. The key idea in this work is to utilize a decoupled, layered design that interposes systems on top of training frameworks to build low-latency, scalable serving systems. Velox introduced this decoupled architecture to enable fast online learning and model personalization in response to feedback. Clipper generalized this system architecture to be framework-agnostic and introduced a set of optimizations to reduce and bound prediction latency and improve prediction throughput, accuracy, and robustness without modifying the underlying machine learning frameworks. And InferLine provisions and manages the individual stages of prediction pipelines to minimize cost while meeting end-to-end tail latency constraints.

Cover page: The Design and Implementation of Low-Latency Prediction Serving Systems

Thesis
Peer Reviewed

Efficiently Designing Efficient Deep Neural Networks

Wan, Alvin
Advisor(s): Gonzalez, Joseph E

UC Berkeley Electronic Theses and Dissertations (2022)

A number of competing concerns slow adoption of deep learning for computer vision on“edge” devices. Edge devices provide only limited resources for on-device algorithms to employ, constraining power, memory, and storage usage. Examples include mobile phones, autonomous vehicles, and virtual reality headsets, which demand both high accuracy and low latency, two objectives competing for resources.

To tackle this sisyphean task, modern methods expend gargantuan amounts of computationto design solutions, exceeding thousands of GPU hours or years of GPU compute to design a single neural network. Not to mention, these works maximize just one performance metric – accuracy – under a single set of resource constraints. What if the set of resource constraints changes? If additional performance metrics rise to the forefront, such as explainability or generalization? Modern methods for designing efficient neural networks are handicapped by excessive computation requirements for goals too singularly and narrowly sighted.

This thesis tackles the bottlenecks of modern methods directly, achieving state-of-the-artperformance by efficiently designing efficient deep neural networks. These improvements don’t only reduce computation or only improve accuracy; instead, our methods improve performance and reduce computational requirements, despite increasing search space size by orders of magnitude. We also demonstrate missed opportunities with performance metrics beyond accuracy, redesigning the task so that accuracy, explainability, and generalization improve jointly, an impossibility by conventional wisdom, which suggests explainability and accuracy participate in a zero-sum game.

This thesis culminates in a set of models that set new flexibility and performance standards forproduction-ready models: those that are state-of-the-art accurate, explainable, generalizable, and configurable for any set of resource constraints in just CPU minutes.

Cover page: Efficiently Designing Efficient Deep Neural Networks

Thesis
Peer Reviewed

Safe Reinforcement Learning Using Learned Safe Sets

UC Berkeley Electronic Theses and Dissertations (2022)

Reinforcement learning is an increasingly popular framework that enables robots to learn to perform tasks from prior experience in environments where dynamics or shaped reward functions are challenging to model. However, because this requires robots to sample trajectories under significant dynamical uncertainty, the robot may perform unsafe maneuvers during online exploration. This is particularly problematic in real-world robotics, where unsafe behaviors can lead to damage to surroundings. As a result, many impressive reinforcement learning results are in simulation only. Safe reinforcement learning is a field with a rich history that studies how to reduce the number and magnitude of unsafe behaviors during learning, particularly in the real world. Safe reinforcement learning is challenging, because it requires limiting exploration to provide safety, but enabling sufficient exploration to maximize the task reward function. Algorithms frequently draw inspiration from methods in control theory, constrained optimization, and online learning to adaptively balance task-driven exploration and safety based on prior experience.

This thesis presents a set of novel safe reinforcement learning algorithms that maintain subsets of the state space where safety is highly probable under the current policy. The algorithms leverage these safe sets in different ways to promote safety during online exploration in the real world. The first part of the thesis covers a class of algorithms that requires the robot to maintain a conservative safe set of states from which it has already completed the task. As long as the robot approximately maintains the ability to return to the safe set, the robot can explore outside the safe set and iteratively expand it. This thesis also presents strong theoretical guarantees for this class of algorithms under known but stochastic, nonlinear dynamics. The second part presents another class of algorithms that maintains a much larger safe set based on the probability of the robot committing unsafe behaviors. The robot uses the boundary of this set to determine whether it should focus on task-driven exploration or safety recovery maneuvers. The final part of this thesis covers an algorithm that uses policy uncertainty to implicitly model safety and request human interventions for corrective feedback. This thesis concludes with a commentary on lessons learned and future endeavors.

Cover page: Safe Reinforcement Learning Using Learned Safe Sets

Thesis
Peer Reviewed

Reliable Multimodal Models

UC Berkeley Electronic Theses and Dissertations (2024)

Before deploying a machine learning model in a real application, it is important to ensure its reliability – this can take many forms, yet is broadly defined as operating without failure. For instance, an incorrect prediction from a model could have a myriad of negative downstream effects, especially if a user has placed trust in the model or if the error is consumed and propagated by automated agents. Multimodal models are growing in their capabilities and applications, yet research into the unique challenges they pose around reliability has been limited.

In this thesis, I cover my work towards improving reliability in the context of multimodal (vision + language) models. This is approached from three different axes: addressing visual biases via model explainability, learning better confidence estimates to abstain from answering questions with high uncertainty as well as reducing hallucinations in generated text, and investigating the contribution of language priors to caption error. In these works, I also present new evaluation frameworks that define particular areas of reliability. As machine learning models take a larger role in our society, carefully measuring and improving reliability becomes more important than ever.

Thesis
Peer Reviewed

Scalable and Efficient Systems for Large Deep Learning Models

UC Berkeley Electronic Theses and Dissertations (2024)

Recent advancements in machine learning have primarily been driven by large-scale deep learning models, particularly large language models. The large scale and new capabilities of these models present challenges in designing infrastructure systems to support their entire lifecycle, from training and serving to evaluation. To meet the high computational and memory requirements of these models, while fully utilizing and accurately evaluating their capabilities, we need to redesign many system components, such as compilers, distributed computing platforms, programming systems, and evaluation methods.

In this dissertation, we introduce a suite of systems designed and built to support large models, covering training, serving, and evaluation phases. First, we discuss Alpa, a system for large-scale model-parallel training, which automatically generates distributed execution plans integrating both inter- and intra-operator parallelism. Moving on to serving, we introduce Ansor, a compiler that produces high-performance implementations of tensor programs for various hardware backends. We also explore SGLang, a system for deploying large language models that includes both a flexible front-end programming interface and an optimized back-end runtime for fast inference. Lastly, in the evaluation phase, we detail our efforts in model evaluation, which include Chatbot Arena, a crowdsourced live benchmark platform, and LLM-as-a-Judge, an automated evaluation pipeline. These tools collectively form a full-stack system for the continuous improvement of large models.

Thesis
Peer Reviewed

Towards Robust and Scalable Large Language Models

UC Berkeley Electronic Theses and Dissertations (2023)

This dissertation addresses two significant challenges of large language models (LLMs): robustness and scalability. Firstly, we focus on improving large language model robustness through the lens of learning code representations. I highlight our work on ContraCode which learns representations of code that are robust to label-preserving edits. Secondly, we tackle scalability challenges from a systems perspective. We present Checkmate, a system to support training models beyond GPU memory capacity limits through optimal rematerialization. Furthermore, Skyplane, a system that optimizes bulk data transfers between cloud object stores, enables training models on larger pre-training datasets in the cloud. Together, these contributions present a roadmap for enhancing the robustness and scalability of large language models.

Cover page: Towards Robust and Scalable Large Language Models

Thesis
Peer Reviewed

The Serverless Datacenter: Hardware and Software Techniques for Resource Disaggregation

UC Berkeley Electronic Theses and Dissertations (2022)

Datacenters have grown beyond a simple collection of independent computers. They are now a complex and interconnected ecosystem of heterogeneous hardware and software services: a warehouse-scale computer. These computers are wildly expensive to provision and operate, yet we struggle to effectively utilize them. It is not uncommon to have half of allocated resources unused, while other resources cannot be allocated at all. Resource needs vary widely, both between jobs, and even over time within a single job. When we aggregate resources into fixed “slots” (i.e., servers), we take away the flexibility needed to accommodate these varying needs. I propose a different approach: resource disaggregation. Rather than requiring the system to fit jobs into fixed-sized servers, we make any resource in the system available to any job (physical disaggregation). Rather than requiring jobs to allocate all their resources up-front, we allow them to allocate resources only when they actually need them (logical disaggregation). I argue that unlocking the full potential of physical disaggregation requires moving logical interfaces to a fundamentally disaggregated paradigm. Likewise, logically disaggregated systems can provide some benefit on today’s hardware, but only reach their full potential when co-designed with physically disaggregated hardware. In this dissertation, I present tools and methodologies I developed to support that hardware/software co-design. I then describe how I used these tools to implement a simple hardware accelerator that works with the operating system to improve the performance of physically disaggregated memory. I evaluate it with end-to-end benchmarks in RTL simulation and find that it reduces the latency of remote memory access by 2.2x and improves end-to-end performance by 20% over a software-only approach. Next, I show how I extended the logically disaggregated serverless programming model to heterogeneous compute resources. My prototype achieves 50x better performance with fewer resources than today’s aggregated approaches. Together, these techniques form a vision of a serverless datacenter that unlocks the promise of pay-per-use and rapid innovation that warehouse-scale computers should provide.

Cover page: The Serverless Datacenter: Hardware and Software Techniques for Resource Disaggregation

Thesis
Peer Reviewed

Machine Learning for Automatic Resource Management in the Datacenter and the Cloud

UC Berkeley Electronic Theses and Dissertations (2018)

Traditional resource management techniques that rely on simple heuristics often fail to achieve predictable performance in contemporary complex systems that span physical servers, virtual servers, private and/or public clouds. My research aims to bring the benefits of data-driven models to resource management of such complex systems. In my dissertation, I argue that the advancements in machine learning can be leveraged to manage and optimize today's systems by deriving actionable insights from the performance and utilization data these systems generate. To realize this vision of model-based resource management, we need to deal with the key challenges data-driven models raise: uncertainty in predictions, cost of training, generalizability from benchmark datasets to real-world systems datasets, and interpretability of the models.

In this dissertation, to demonstrate how to handle these challenges, we chose two main problem domains: (I) Scheduling in parallel data intensive computational frameworks for improved tail latencies, and (II) Performance-aware resource allocation in the public cloud environments for meeting user-specified performance and cost goals.

We begin by presenting Wrangler, a system that predicts when stragglers (slow-running tasks) are going to occur based on cluster resource utilization counters and makes scheduling decisions to avoid such situations. Wrangler introduces a notion of a confidence measure with these predictions to overcome modeling uncertainty. We then describe our Multi-Task Learning formulations that share information between the various models, allowing us to significantly reduce the cost of training. To capture the challenges of resource allocation in the public cloud environments, we present key observations from our empirical analysis based on performance profiles of workloads executing across different public cloud environments. Finally, we describe PARIS, a Performance-Aware Resource Inference System, that we built to enable cloud users to select the best VM (virtual machine) for their applications in the public cloud environments so as to satisfy any performance and cost constraints.

Cover page: Machine Learning for Automatic Resource Management in the Datacenter and the Cloud

Article
Peer Reviewed

CathAI: fully automated coronary angiography interpretation and stenosis estimation

UC San Francisco Previously Published Works (2023)

Coronary angiography is the primary procedure for diagnosis and management decisions in coronary artery disease (CAD), but ad-hoc visual assessment of angiograms has high variability. Here we report a fully automated approach to interpret angiographic coronary artery stenosis from standard coronary angiograms. Using 13,843 angiographic studies from 11,972 adult patients at University of California, San Francisco (UCSF), between April 1, 2008 and December 31, 2019, we train neural networks to accomplish four sequential necessary tasks for automatic coronary artery stenosis localization and estimation. Algorithms are internally validated against criterion-standard labels for each task in hold-out test datasets. Algorithms are then externally validated in real-world angiograms from the University of Ottawa Heart Institute (UOHI) and also retrained using quantitative coronary angiography (QCA) data from the Montreal Heart Institute (MHI) core lab. The CathAI system achieves state-of-the-art performance across all tasks on unselected, real-world angiograms. Positive predictive value, sensitivity and F1 score are all ≥90% to identify projection angle and ≥93% for left/right coronary artery angiogram detection. To predict obstructive CAD stenosis (≥70%), CathAI exhibits an AUC of 0.862 (95% CI: 0.843-0.880). In UOHI external validation, CathAI achieves AUC 0.869 (95% CI: 0.830-0.907) to predict obstructive CAD. In the MHI QCA dataset, CathAI achieves an AUC of 0.775 (95%. CI: 0.594-0.955) after retraining. In conclusion, multiple purpose-built neural networks can function in sequence to accomplish automated analysis of real-world angiograms, which could increase standardization and reproducibility in angiographic coronary stenosis assessment.

Cover page: CathAI: fully automated coronary angiography interpretation and stenosis estimation