In recent years, the landscape of molecular science has been profoundly transformed by the integration of data-driven methodologies alongside traditional deterministic and stochastic approaches. Historically, the study of molecular behavior and interactions relied heavily on deterministic algorithms, which follow a fixed sequence of computational steps to simulate molecular dynamics, and stochastic simulations, which incorporate randomness to explore various molecular states and pathways. These methods were complemented by physical models grounded in the established principles of chemistry and physics, forming the backbone of theoretical molecular science. However, these conventional approaches often faced limitations in scalability, computational cost, and generalizability for complex systems. The improvements in computational hardware, coupled with the accumulation of vast amounts of molecular data, have enabled the development of models that can surpass traditional methods in both accuracy and efficiency, leveraging both physics-based and machine learning (ML) approaches. This dissertation focuses on the development of new models utilizing more accessible data, provides guidelines for computational data generation, and explores the synergy between data acquisition strategies and data-driven models. These studies demonstrate that by carefully designing data acquisition strategies and integrating data-driven models with physics-based approaches, it is possible to enhance the predictive capabilities of computational methods in chemistry, particularly in force field development and electrostatic modeling. Through a series of studies, this work illustrates the potential of combining the strengths of both traditional and modern computational techniques to achieve more accurate and efficient predictions in molecular science.
The accurate prediction of electrostatic interactions is a critical aspect of understanding molecular behavior. The electrostatic potential (ESP) is a property of great research interest for understanding and predicting electrostatic charge distributions that drive interactions between molecules. However, traditional approaches often rely on detailed quantum mechanical calculations, which can be computationally expensive. In Chapter 2, I introduce a coarse-grained electron model (C-GEM), whose parameters are fitted to computationally generated high-quality Density Functional Theory (DFT) data, that offers a balance between accuracy and computational efficiency. Extensive validation against high-level quantum mechanical calculations demonstrates that C-GEM can reliably predict electrostatic potentials and interaction energies in proteins. The model's implementation in large-scale molecular simulations shows significant reductions in computational cost, making it a viable tool for studying complex biological systems.
The generation of reference data for deep learning models poses significant challenges for reactive systems, especially for combustion reactions due to the extreme conditions that produce radical species and alternative spin states. In Chapter 3, intrinsic reaction coordinate (IRC) calculations are extended with \textit{ab initio} molecular dynamics (MD) simulations and normal mode displacement calculations to comprehensively map the potential energy surface (PES) for 19 reaction channels involved in hydrogen combustion. This extensive dataset comprises approximately 290,000 potential energies and 1,260,000 nuclear force vectors, evaluated using a high-quality range-separated hybrid density functional, $\omega$B97X-V. The dataset includes detailed information on transition state configurations as well as geometries along the reactive path way that links reactant to product, providing a robust reference for training deep learning models aimed at studying hydrogen combustion reactions. This benchmark dataset not only serves as a valuable resource for understanding the intricate mechanistic pathways of hydrogen combustion but also provide a paradigm for building dataset that facilitates the development and validation of machine learning models for reactive chemistry.
Building on the extensive benchmark dataset for hydrogen combustion detailed in Chapter 3, an initial machine learning model is trained to predict energies and forces for hydrogen combustion reactive system using NewtonNet, a physics inspired equivariant message passing neural network(MPNN). This reactive gas phase chemistry network is particularly challenging due to the need for comprehensive potential energy surfaces that accurately represent a wide range of molecular configurations. Traditional approaches often rely on chemical intuition to select training data, which can result in incomplete PESs in an ML setting. To address this challenge, I employ an active learning strategy to systematically explores diverse energy landscapes using metadynamics simulations and continuously adding unseen data for retraining, helping to create a ML model that avoids unforeseen high-energy or unphysical configurations. By integrating metadynamics, the active learning process more rapidly converges the PES, also allowing a hybrid of ML and ab initio molecular dynamics (MD) that initiates rare calls to external ab initio sources when discrepancies are detected by the query by committee models. This hybrid ML-physics approach reduces computational costs by two orders of magnitude and eliminates the need for excessive ML retraining. The enhanced model accurately predicts free energy changes and transition state mechanisms for several hydrogen combustion reaction channels, demonstrating the efficacy of combining advanced data acquisition strategies with robust ML techniques to achieve high precision and efficiency in molecular simulations.
To summarize, this dissertation underscores the potential of combining data-driven models with physics-based approaches to overcome the limitations of traditional computational methods in molecular science. Through the development of the coarse-grained electron model (C-GeM), the creation of a comprehensive benchmark dataset for hydrogen combustion, and the implementation of an active learning workflow for reactive force field development, insights are provided in developing new computational tools and leveraging them to better understand molecular interactions and reactivity.