Theoretical studies of molecules have historically relied on deterministic algorithms, stochastic simulations, and physical models. Recently modern data-driven methods are starting to infiltrate into various fields of molecular science, opening new possibilities for solving problems that are difficult to tackle through traditional approaches. The accumulation of data, advancement of machine learning algorithms and improvement in hardware enables a plethora of data-driven approaches to surpass traditional methods in terms of accuracy and efficiency, but questions remain about how well these data-driven methods can generalize to unseen data to do true prediction. In this dissertation, I will show that when data-driven models are combined with physics-based approaches, through either feature design, or exerting constrains on the machine learning models, new standards can be established in the fields of molecule characterization and generation.
Nuclear magnetic resonance (NMR) chemical shifts (CS) are extremely sensitive to the local atomic environments for different nuclei in a molecule, and therefore is a common technique in molecule characterization. In chapter 2, I focus on the design of the UCBShift predictor for CSs for proteins in aqueous solution. The UCBShift method uniquely fuses a transfer prediction module, which employs sequence and structure alignments to select reference chemical shifts from a database, with a machine learning model that uses carefully curated and physics-inspired features, to predict CSs for proteins with higher accuracy and better reliability compared to all popular methods such as SHIFTX2 and SPARTA+. This chapter further delineates how UCBShift benefits from realistic data that has not been heavily curated, and surpasses existing CS calculators in terms of real-world performance without eliminating test predictions ad hoc.
However, in order to achieve rigorous and consistent improvement for an arbitrary molecular system, carefully curated feature sets specifically for proteins can be limiting, and we seek features from theoretical calculations. In Chapter 3 I describe the development of a novel neural network model which employs quantum mechanical (QM) features from affordable Density Functional Theory (DFT) calculations, along with geometric features of the molecular systems, to predict NMR chemical shieldings. The resulting iShiftML model predicts chemical shieldings approaching the highest level of accuracy under the modern theoretical framework of CCSD(T) in the complete basis set limit, but without the computational burden that limits its applicability to large systems. Not only does the iShiftML model demonstrate excellent predictive performance when compared with small molecule gas phase experimental CSs, but it also offers a capability to predict chemical shifts for much more complex natural products, and can be used for differentiating diasteromers based on chemical shift assignments. This chapter unveils new possibilities for integrating machine learning and QM calculations for accurate and transferable molecular characterization.
In Chapters 4 an 5, my research addresses fundamental issues for large and small molecule generation relevant to proteins and drug molecules. Chapter 4 describes the Int2Cart method that uses a recurrent neural network to predict the correlations between bond lengths, bond angles and backbone torsion angles and amino acid sequence of a protein. By incorporating these correlations, proteins reconstructed from just torsion angles display not only physically more accurate bond lengths and bond angles, but the reconstructed proteins are closer to their crystal structures than under the common assumption that bond lengths and bond angles are fixed, or that coming from a static library that only relies on local residue geometries. I have also shown potential applications of this method in estimating model quality for AlphaFold2 predicted structures, and reconstructing intrinsically disordered protein (IDP) ensembles with decreased steric overlap. Chapter 5 describes the combination of deep generative networks trained by reinforcement learning and physical docking study. I developed the iMiner method, which generates de novo drug-like molecules with an augmented binding potency towards specific protein targets, facilitating the discovery of potential new therapeutic targets. SARS-COV-2 Main Protease was used as an example to show that our generated molecules cover a broader chemical space than crowdsourcing efforts, and the newly generated molecules exert optimized interactions and correct shape for the compatibility with the binding pocket.
To summarize, this dissertation contains multiple methods that harmonize data-driven models and physics-based approaches in the area of NMR spectroscopy, protein structure modelling and de novo drug discovery, which provides a new perspective for researchers striving to leverage computational methods in molecular science and chemical biology.