The rapid digitization of healthcare has led to a proliferation of clinical data, manifesting through electronic health records, biorepositories, and disease registries. This dissertation addresses the question of how machine learning (ML) techniques can capitalize on these data resources to assist clinicians in predicting, preventing and treating illness. To this end, we develop a set of MLbased, data-driven models of patient outcomes that we envision to be embedded within systems of decision support deployed at different stages of patient care.
We focus on two broad setups for analyzing clinical data: (1) the cross-sectional setup wherein data is collected by observing many patients at a particular point of time, and (2) the longitudinal setup in which repeated observations of the same patient are collected over time. In both setups, we develop models that are: (a) capable of answering counter-factual questions, i.e., can predict outcomes under alternative treatment scenarios, (b) interpretable in the sense that clinicians can understand how the model predictions for individual patients are issued, and (c) automated in the sense that they adaptively tune their modeling choices for the dataset at hand, with little or no need for expert intervention. Models satisfying these three requirements would enable the realization of actionable, transparent and automated decision support systems that operate symbiotically within existing clinical workflows.
Our technical contributions are multi-faceted. In the cross-sectional data setup, we develop ML models that fulfill the aforementioned requirements (a)-(c) as follows. We start by developing a comprehensive theoretical framework for causal inference, whereby we quantify the limits to how well ML models can recover the causal effects of counter-factual treatment decisions on individual patients using observational (retrospective) data, and we build ML models — based on Gaussian processes — that achieve these limits. Next, we develop a novel symbolic meta-modeling approach for interpreting the predictions of any ML-based prognostic model by converting the “black-box” model into an understandable symbolic equation that relates patients’ features to their predicted outcomes. Finally, we develop a model selection approach based on Bayesian optimization that enables the automation of predictive and causal modeling. In the longitudinal data setup, we develop a novel deep probabilistic model for sequential clinical data that satisfies requirements (a)- (c) by capitalizing on the strengths of both state-space models and deep recurrent neural networks.
To demonstrate the utility of our models, we evaluate their performance on various real-world datasets for cohorts of breast cancer, cardiovascular disease and cystic fibrosis patients. We show that, compared to existing clinical scorers, our ML-based models can improve the accuracy of predicting individual-level prognoses, guide treatment decisions for individual patients, and provide insights into underlying disease mechanisms.