This thesis is organized in a slightly unconventional fashion: algorithms lead and appli-cations fill out the content. I think this emphasizes my interests during graduate school -
I built algorithms and tools to address issues that were otherwise inaccessible to different
areas of computational chemistry (including applied machine learning) and enzymology. Two
sets of scientific thrusts underscore the bulk of my work: algorithms to analyze dynamic,
heterogeneous fields in the context of enzymology and flexible machine learning algorithms,
including those that leverage quantum descriptors, for rigorous molecular and reaction-level
properties. Each section will include grounding on applications and broader impacts for
the reader as well. Now we pivot to discussing the main thrusts and outlining each chapter
briefly.
General ML and Quantum Theory of Atoms-in-Molecules (QTAIM): QTAIMserves as a mathematical decomposition algorithm for electronic basins within a molecule.
The algorithm intakes molecular densities, as computed (typically) by density functional
theory (DFT), and uses the flux of density to partition the scalar field into 3-dimensional
atomic basins of density [14, 16]. These objects are known as atomic basins and represent
the quantum atom within a molecule. By constructing these structures, we compute a rich
set of mathematical descriptors that map to many features including energies, bonding,
and electron delocalization. These features have been correlated, in the past, to activation
energies, reactivity, and overall system energies, but these uses largely relied on human
intervention and small datasets [44, 62, 65, 111, 142, 287]. By developing software centered
around high-throughput QTAIM calculations and machine learning, I was able to bring these
descriptors to larger datasets and a wide host of applications.
In Chapter 2, I discuss an algorithm I implemented to predict Diels-Alder reaction
barriers from QTAIM signatures alone. In this study, we showed that QTAIM features, can be
used to surmise reaction barriers while also using machine learning techniques to understand
what signatures were most informative to our models. Here QTAIM electrostatic potentials
and delocalization indices alone were able to yield great performance on withheld datasets.
In addition, we demonstrated that QTAIM features can allow a machine learning model to
generalize, to an extent, to much larger Diels-Alder reactions. This chapter was adapted from
the following: Machine Learning to Predict Diels–Alder Reaction Barriers from the Reactant
State Electron Density. S. Vargas*, M. Hannefarth, Z. Liu, A.N. Alexandrova. Journal of
Chemical Theory and Computation 2021 17 (10), 6203-6213. 10.1021/acs.jctc.1c00623.
In Chapter 3, I discuss a package developed to perform high-throughput QTAIM
calculations on datasets of molecules and reactions. This package is currently adapted to
work with open-source packages such as ORCA and Multiwfn. These softwares, respectively,
compute DFT densities at a user-specified level of theory and subsequently compute QTAIM
descriptors. The package is built with high-performance compute (HPC) in mind as it
can operate on a single dataset with an arbitrary number of concurrent jobs. Here I also
used the package to compute QTAIM values for a diverse set of important and difficult
datasets and developed graph neural networks to predict molecular and reaction properties
leveraging QTAIM as inputs. This chapter was adapted from the following: This was adapted
from High-throughput quantum theory of atoms in molecules (QTAIM) for geometric deep
learning of molecular and reaction properties Santiago Vargas, Winston Gee, and Anastassia
N. Alexandrova. Digital Discovery 2024 3, 987-998.
Advancing Analysis of Electric Fields in Proteins: The later chapters follow ourwork in developing algorithms to ingest, interpret, and predict on electric fields in protein
active sites. This work builds on the notion of electrostatic preorganization, a theory that
posits that protein scaffolds arrange to electrostatically catalyse chemical reactions, and
thereby, destabilizing reactants while suppressing transition state energies [299, 301].
Chapter 4 depicts exhaustive efforts to apply heterogenous electric field analysis to
understanding directed evolution in the context of a protoglobin directed evolution (DE)
trajectory. Previous DE efforts optimized protoglobin to efficiently catalyze carbene transfer
reactions. We show that traditional explanations for increased catalytic activity across the
DE lineage, substrate access and binding, cannot account for the dramatic improvements in
protein activity. By tracking the 3-D electric field and using clustering algorithms, we pinpoint
representative structures for QM/MM calculations and show that changes in the electric field,
along DE, improve carbene transfer reactivity. These findings highlight the role electrostatic
organization, notably its dynamic effect, has on determining protein function and points to
its future importance in designing proteins for relevant chemical processes. This chapter is
adapted from Directed Evolution of Protoglobin Optimizes the Enzyme Electric Field. Shobhit
S. Chaturvedi, Santiago Vargas, Pujan Ajmera, and Anastassia N. Alexandrova. Journal of
the American Chemical Society 2024 146 (24), 16670-16680 DOI: 10.1021/jacs.4c03914.
In Chapter 5, I introduce a machine learning framework designed to predict enzyme
functionality directly from the heterogeneous electric fields applied to protein active sites. We
apply this method to a dataset of Heme-Iron Oxidoreductases. Previous studies here, focused
on simple, point electric fields along the Fe-O bond, are insufficient for reasonable accuracy.
On the otherhand, our 3-D, heterogenous model can accurately predict protein activity
without relying on additional protein-specific information. In addition, feature selection
elucidates what electric field components most inform our models and thus highlight important
components to reactivity and selectivity. Finally, we apply previously-mentioned electric
field clustering algorithms and QM/MM calculations to reveal how dynamic complexities in
protein structures can complicate predictions and thus provides a path forward for improved
models in this space. This chapter is adapted from Machine-learning prediction of protein
function from the portrait of its intramolecular electric field. S. Vargas*, S. Chaturvedi, A.N.
Alexandrova. (Accepted, Journal of the American Chemical Society)