Skip to main content
eScholarship
Open Access Publications from the University of California

UC Berkeley

UC Berkeley Electronic Theses and Dissertations bannerUC Berkeley

UC Berkeley Electronic Theses and Dissertations

Data-Driven Synthesis Science: Text Mining the Literature to Rationalize Inorganic Synthesis Pathways and Outcomes

(2024)

The rate of discovery of novel materials has accelerated in recent decades, and the slow step in realizing these materials has long been in synthesis design. Improving our understanding of the sequence of phases formed during reaction, from precursors to target, and how that sequence is affected by the conditions for synthesis would provide researchers with a means to predict the conditions required to reach new desired outcomes. Because of the vast dimensionality in synthesis design space, such a prediction task is suitable for data-driven methods: combining machine learning with real experiment and computational modeling to both generate and test hypotheses that rationalize synthesis pathways. Driving hypothesis generation from data requires a substantial source for historical syntheses; in this thesis, we leverage the availability of synthesis procedures and characterized outcomes from the scientific literature. Text mining of the scientific literature, using natural language processing as well as manual methods, has been extensively employed in materials science over the past decade, with applications ranging from systematic literature review to generating datasets of material properties to studies in synthesis science. Yet, in the sphere of text mining for synthesis science little attention has been paid to distinguishing syntheses by their outcome (e.g., final phase purity, particle morphology). The availability of such experimental outcome information and subsequent analysis would provide researchers with useful data to develop models that hypothesize the effects of specific synthesis features on desired outcomes. By tapping into the available synthesis literature, this recipe-outcome-paired data would be ample.

Progress in automatic text mining of the scientific literature is persistent, particularly with the recent rise of generative large language models. However, there still remain pitfalls in the reliable extraction of materials synthesis recipes and linking to outcome, making manual curation of such datastets more attractive in some cases. This thesis highlights both (1) advances made in automatic methods for acquiring inorganic synthesis procedures and outcomes from the literature as well as (2) data-driven insights into synthesis science that are gleaned from synthesis datasets extracted manually from the literature in combination with direct experiment and first principles computations. For (1), we developed robust named entity recognition models for the extraction of synthesis procedure graphs as well as morphological outcomes for nanoparticle synthesis. To demonstrate the application of these methods, we constructed a large-scale text-mined dataset of gold nanoparticle synthesis recipes, which are plentiful in the literature and thus represent a rich source for data-driven synthesis design. Importantly, we include extraction of their morphological outcomes; the inclusion of both input synthesis conditions and the corresponding output makes this dataset valuable for data-driven synthesis science efforts. For task (2), we focus on the extraction of phase purity outcomes for oxide materials. Impurity phases, in the form of remnant precursors or intermediate phases, offer clues into the synthesis pathway traversed in a reaction. BiFeO3 is an important multiferroic material that is frequently synthesized in the literature and has a strong tendency to coincide with competitive impurity phases when synthesized. We thus pivot our focus in chemistry space to BiFeO3 for this second task. For this system, we demonstrate how text-mined datasets of such recipes and outcomes can be used to inform real experiments and computational modeling that rationalize synthesis pathways.

In this thesis, we endeavor to improve the role of text mining, automated and manual, in data-driven synthesis science. Our attention on extracting synthesis outcomes in addition to their corresponding procedures helps advance this subfield of materials science and paves the way for future efforts to accelerate both progress in our understanding of synthesis mechanisms for existing materials and the discovery of new compounds.

Cover page of Object-Centric Perception for Real-World Robotics

Object-Centric Perception for Real-World Robotics

(2024)

Deep learning has resulted in incredible progress in many applications of artificial intelligence.However, these techniques often fall short when applied to robotics, due to their inability to reason about the ambiguity that often arises in the real world. Much of this ambiguity stems from the real world’s long-tail visual diversity – in particular, the huge variety of objects that robots must interact with. Such shortcomings are only exacerbated by the strict requirements for autonomous, high-throughput operation that deployed systems must meet, as well as the cost and difficulty of obtaining the large-scale training datasets that modern deep learning methods require.

In this thesis, we explore two primary avenues of addressing these challenges. First, we introduce models that can better express uncertainty in challenging or ambiguous situations, across a variety of 2D and 3D perception tasks. Real-world robots can incorporate these models to reason explicitly about ambiguity, in flexible ways depending on their specific tasks. Second, we extend the capabilities of neural renderers to develop a sim2real2sim method that can drastically reduce the amount of data needed to train such models. From only a handful of in-the-wild examples, our method learns to generate synthetic scenes, targeted to specific real objects and environments, that can be used to train downstream perception models for a variety of tasks.

Application of Engineered Cell Culture Models to Study Glioma Stem Cell Motility

(2024)

In the past two decades, glioma stem cells (GSCs) have been increasingly implicated in driving tumor initiation, resistance, recurrence, growth and invasion in glioblastoma (GBM). Despite in vivo evidence identifying GSCs at invasive regions of GBM tissue, GSC invasion and by extension motility has been less well appreciated. Furthermore, GBM tumor invasion is informed by a multitude of biochemical factors, such as cytokines and growth factors, and biophysical factors, such as matrix and stroma geometry and mechanics. GBM tumors invade slowly through the hyaluronic acid (HA)-rich parenchyma and then rapidly along microvascular tracks of varying geometries. How biophysical and biochemical factors together contribute to driving invasion of GBM tumor cells such as GSCs within these regions is not well understood. Progress in understanding this combinatorial effect is limited by a lack of physiologically representative cell culture models that can enable systematic investigations to gain better understanding of regulators of invasion.

In this dissertation, we first applied an HA-based hydrogel model of the brain parenchyma to study transforming growth factor beta (TGF- β) induced invasion of GSCs. We demonstrate that in response to TGF-β, GSCs differ in their ability to invade HA in a way that can be predicted from TGF-β receptor 2 expression and SMAD2 phosphorylation. Additionally, we found an association between TGF-β responsiveness and GSC subtype. Interestingly, TGF- β stimulated GSC invasion exhibited a strong dependence on the presence of RGD peptides. Next, we deployed protein micropattern lines with vessel-like geometries to understand the emergent cell migration behavior of GSCs along a confined environment. We tested multiple GSC lines and found that vessel-like geometries enhanced both migration speed and persistence in GSCs. However, no individual GSC line demonstrated both enhanced migration speed and persistence, suggesting that vessel-like geometric confinement differentially influence migration dynamics of GSCs.

Cover page of Designing Machine Learning-Enhanced Tools and Physics-based Techniques for Force Field and Electrostatic Models

Designing Machine Learning-Enhanced Tools and Physics-based Techniques for Force Field and Electrostatic Models

(2024)

In recent years, the landscape of molecular science has been profoundly transformed by the integration of data-driven methodologies alongside traditional deterministic and stochastic approaches. Historically, the study of molecular behavior and interactions relied heavily on deterministic algorithms, which follow a fixed sequence of computational steps to simulate molecular dynamics, and stochastic simulations, which incorporate randomness to explore various molecular states and pathways. These methods were complemented by physical models grounded in the established principles of chemistry and physics, forming the backbone of theoretical molecular science. However, these conventional approaches often faced limitations in scalability, computational cost, and generalizability for complex systems. The improvements in computational hardware, coupled with the accumulation of vast amounts of molecular data, have enabled the development of models that can surpass traditional methods in both accuracy and efficiency, leveraging both physics-based and machine learning (ML) approaches. This dissertation focuses on the development of new models utilizing more accessible data, provides guidelines for computational data generation, and explores the synergy between data acquisition strategies and data-driven models. These studies demonstrate that by carefully designing data acquisition strategies and integrating data-driven models with physics-based approaches, it is possible to enhance the predictive capabilities of computational methods in chemistry, particularly in force field development and electrostatic modeling. Through a series of studies, this work illustrates the potential of combining the strengths of both traditional and modern computational techniques to achieve more accurate and efficient predictions in molecular science.

The accurate prediction of electrostatic interactions is a critical aspect of understanding molecular behavior. The electrostatic potential (ESP) is a property of great research interest for understanding and predicting electrostatic charge distributions that drive interactions between molecules. However, traditional approaches often rely on detailed quantum mechanical calculations, which can be computationally expensive. In Chapter 2, I introduce a coarse-grained electron model (C-GEM), whose parameters are fitted to computationally generated high-quality Density Functional Theory (DFT) data, that offers a balance between accuracy and computational efficiency. Extensive validation against high-level quantum mechanical calculations demonstrates that C-GEM can reliably predict electrostatic potentials and interaction energies in proteins. The model's implementation in large-scale molecular simulations shows significant reductions in computational cost, making it a viable tool for studying complex biological systems.

The generation of reference data for deep learning models poses significant challenges for reactive systems, especially for combustion reactions due to the extreme conditions that produce radical species and alternative spin states. In Chapter 3, intrinsic reaction coordinate (IRC) calculations are extended with \textit{ab initio} molecular dynamics (MD) simulations and normal mode displacement calculations to comprehensively map the potential energy surface (PES) for 19 reaction channels involved in hydrogen combustion. This extensive dataset comprises approximately 290,000 potential energies and 1,260,000 nuclear force vectors, evaluated using a high-quality range-separated hybrid density functional, $\omega$B97X-V. The dataset includes detailed information on transition state configurations as well as geometries along the reactive path way that links reactant to product, providing a robust reference for training deep learning models aimed at studying hydrogen combustion reactions. This benchmark dataset not only serves as a valuable resource for understanding the intricate mechanistic pathways of hydrogen combustion but also provide a paradigm for building dataset that facilitates the development and validation of machine learning models for reactive chemistry.

Building on the extensive benchmark dataset for hydrogen combustion detailed in Chapter 3, an initial machine learning model is trained to predict energies and forces for hydrogen combustion reactive system using NewtonNet, a physics inspired equivariant message passing neural network(MPNN). This reactive gas phase chemistry network is particularly challenging due to the need for comprehensive potential energy surfaces that accurately represent a wide range of molecular configurations. Traditional approaches often rely on chemical intuition to select training data, which can result in incomplete PESs in an ML setting. To address this challenge, I employ an active learning strategy to systematically explores diverse energy landscapes using metadynamics simulations and continuously adding unseen data for retraining, helping to create a ML model that avoids unforeseen high-energy or unphysical configurations. By integrating metadynamics, the active learning process more rapidly converges the PES, also allowing a hybrid of ML and ab initio molecular dynamics (MD) that initiates rare calls to external ab initio sources when discrepancies are detected by the query by committee models. This hybrid ML-physics approach reduces computational costs by two orders of magnitude and eliminates the need for excessive ML retraining. The enhanced model accurately predicts free energy changes and transition state mechanisms for several hydrogen combustion reaction channels, demonstrating the efficacy of combining advanced data acquisition strategies with robust ML techniques to achieve high precision and efficiency in molecular simulations.

To summarize, this dissertation underscores the potential of combining data-driven models with physics-based approaches to overcome the limitations of traditional computational methods in molecular science. Through the development of the coarse-grained electron model (C-GeM), the creation of a comprehensive benchmark dataset for hydrogen combustion, and the implementation of an active learning workflow for reactive force field development, insights are provided in developing new computational tools and leveraging them to better understand molecular interactions and reactivity.

Cover page of Empowering Large Language Models with Efficient and Automated Systems

Empowering Large Language Models with Efficient and Automated Systems

(2024)

Large Language Models (LLMs) have shown remarkable capabilities in a variety of tasks, including chatting, programming, and searching. However, the high costs of LLMs are preventing these models from being deployed for the vast majority of applications. In this dissertation, we focus on building efficient and automated systems to reduce costs and democratize access to large language models.

We first introduce systems to optimize computational efficiency and reduce the engineering overhead for distributed LLM training. We develop TeraPipe, which proposes a new dimension to perform pipeline parallel training for LLMs, and also Alpa, the world’s first compiler capable of automatically distributing arbitrary neural networks with all existing parallelization methods.

While training is typically a one-time cost, deploying and serving an LLM requires running LLM inference continuously, which is the top blocker for the real-world deployment of LLMs. We improve the serving scalability with AlpaServe through model parallelism, and increase the memory utilization and the LLM inference throughput with a new attention algorithm, PagedAttention, and an end-to-end serving system, vLLM.

Overall, these systems provide comprehensive solutions that significantly improve both training and inference efficiency for large language models. Together, these systems lower the high costs associated with large language models, democratizing their deployment across various real-world applications.

Super-Resolution Microscopy and Single-Molecule Diffusivity Mapping: Applications in Cell Biology and Biophysics

(2024)

Fluorescence microscopy has allowed for decades of elegant and compelling biological discovery. However, the diffraction limit of light caps the spatial resolution of fluorescence microscopy to 250-300 nanometers. As the size of an average protein molecule is ~ 3 nm, diffraction-limited spatial resolution is often more than an order of magnitude larger than what is required to resolve nanoscale cellular structures and processes. With the advent of single-molecule localization microscopy (SMLM) and super-resolution microscopy (SRM), the spatial resolution of fluorescence microscopy has been improved more than 10-fold relative to conventional methods. Further, the fundamental principles of SMLM have engendered multiple other experimental methods to simultaneously probe a second informational domain, in addition to precise spatial information. For example, much of the work in this writing utilizes a functional SRM method known as single-molecule diffusivity mapping (SMdM), which will be overviewed in Chapter 1.3. We use SMdM to show that in the mammalian cell, the assembly and disassembly of the vimentin cytoskeleton is highly sensitive to the protein net charge state. Starting with the intriguing observation that the vimentin cytoskeleton fully disassembles under hypotonic stress yet reassembles within seconds upon osmotic pressure recovery, we pinpoint ionic strength as its underlying driving factor. Further modulating the pH and expressing vimentin constructs with differently charged linkers, we converge on a model in which the vimentin cytoskeleton is destabilized by Coulomb repulsion when its mass-accumulated negative charges are less screened or otherwise intensified. Additionally, we identify a key molecular player, DELE1, in relaying mitochondrial stress to the cytosol and triggering the integrated stress response. Then, using SMdM we corroborate the aforementioned finding and show that the intraorganellar diffusivity of both DELE1 and cytochrome c implies the presence of unique electrostatic interactions in the mitochondrial intermembrane space. Together, these studies represent some of the first applications of SMdM to study native cellular proteins rather than exogenous tracer proteins.

Perceptual Alignment for Human-Centered Design Computing: Quantifying Similarity and Semantic Representations

(2024)

During early-stage design processes, designers must navigate significant uncertainty and make sense of abstract, multi-dimensional goals (e.g., function, aesthetics, ergonomics), eventually synthesizing them into design outcomes. Data-driven design is a paradigm that aims to leverage data and computational methods to support decision making, allowing designers to surpass cognitive limits (e.g., idea fixation). However, concepts fundamental to decision making during early-stage design (e.g., ‘What are similar design ideas?’ and ‘Will the design reflect dependability?’) are ill-defined, cognitively complex, and not well-represented by computation. Therefore, a key challenge is to align computational representations with how humans perceive and process information, enabling designers to accurately express their intent. To address this challenge, my dissertation research explores behavioral studies and computational techniques to understand and quantify representations (both cognitive and reflected within design artifacts) of these complex concepts throughout the design process. First, I demonstrate how function can be quantifiably compared across engineered systems and products, and how human perceptions of similarity align. Then, I show how intangible semantic prompts (e.g., dependable, versatile, comfortable) can be tangibly reflected in designs, by humans and through human-in-the-loop computation. The insights derived from this work contribute to human-centered computing for early-stage design, enabling designers to more easily and effectively design innovative products.

Cover page of Free resolutions, linkage, and representation theory

Free resolutions, linkage, and representation theory

(2024)

Spanning two papers from 1989 and 2018, Weyman unearthed a fascinating connection between commutative algebra and representation theory in his study of generic free resolutions of length three. This thesis is devoted to analyzing this connection further. In the first half, we show that certain Kazhdan-Lusztig varieties provide generic examples of ideals in the linkage class of a complete intersection. For those of embedding codimension three, we also compute the free resolutions of their coordinate rings. We later show that these specialize to resolutions of all grade three licci ideals.

In the second half, we develop the machinery of higher structure maps originating from Weyman's generic ring. Using the free resolutions constructed previously, we disprove Hochster's conjecture on finite generation of generic rings. The two perspectives converge in the final chapter of the thesis, in which we develop an ADE correspondence to completely classify grade three perfect ideals with small type and deviation.

Cover page of Until the Music Stopped: The Second Concert in European Inter-State Relations, 1878-1908

Until the Music Stopped: The Second Concert in European Inter-State Relations, 1878-1908

(2024)

The causes of the First World War remain a central preoccupation for international relations scholars. Some find them in the actions of particular aggressors, others in the logic of zero-sum competition between bipolar alliance blocs. Still others describe how an ever more mechanical European state system became increasingly inflexible until it seized up and exploded. I turn this perennial query on its head, asking not why war erupted in 1914, but how Europe’s political class was able to avoid wars during the thirty-three years of pan-Great Power peace stretching from the Berlin Congress (1878) to the Italo-Turkish War (1911).

I argue that the Berlin Congress founded an international regime, which, like its predecessor founded at the Congress of Vienna (1814-15), was framed as a “Concert of Europe,” and was predicated not on a balance of power but on normative principles of international relations, chief among them being the inviolability of member state territorial integrity and sovereignty. Like its Vienna predecessor, this Second Concert recognized subordinate principles, namely minority protections, national self-determination, and human rights. By surveying reportage on the Concert during the regime-challenging crises these subordinate, and state-challenging, principles instigated, I show how the Concert-loyalty of the regime’s member states led to affirmations of the supremacy of the territorial state, thereby preserving both the Concert regime and general peace. This era of tranquility ended with the Bosnian Crisis (1908-09), which saw the first violation of Concert principles by one of its members since the regime’s founding, resulting in the Concert’s dissolution. Europe, plunged back into an international state of nature in which power alone ruled, experienced rapidly escalating violence that culminated in general war.

Scalable and Efficient Systems for Large Deep Learning Models

(2024)

Recent advancements in machine learning have primarily been driven by large-scale deep learning models, particularly large language models. The large scale and new capabilities of these models present challenges in designing infrastructure systems to support their entire lifecycle, from training and serving to evaluation. To meet the high computational and memory requirements of these models, while fully utilizing and accurately evaluating their capabilities, we need to redesign many system components, such as compilers, distributed computing platforms, programming systems, and evaluation methods.

In this dissertation, we introduce a suite of systems designed and built to support large models, covering training, serving, and evaluation phases. First, we discuss Alpa, a system for large-scale model-parallel training, which automatically generates distributed execution plans integrating both inter- and intra-operator parallelism. Moving on to serving, we introduce Ansor, a compiler that produces high-performance implementations of tensor programs for various hardware backends. We also explore SGLang, a system for deploying large language models that includes both a flexible front-end programming interface and an optimized back-end runtime for fast inference. Lastly, in the evaluation phase, we detail our efforts in model evaluation, which include Chatbot Arena, a crowdsourced live benchmark platform, and LLM-as-a-Judge, an automated evaluation pipeline. These tools collectively form a full-stack system for the continuous improvement of large models.