Data-driven models for diagnostic and other clinical prediction tasks have been enabled by the increasing availability of electronic health records (EHRs) and recent developments in machine learning (ML). Notably, the clinical event sequences extracted from EHR data provide important insights into how a patient's illness progresses. However, many of the models developed thus far are trained and validated using data from the same distribution (e.g., a single institutional dataset). When externally validated on distributions other than those used for training, these models exhibit generalizability issues despite their reported improvement. The variation in distributions between the training and deployment environment is called dataset shift, which can be attributed to many factors during the data generation process (e.g., patient demographics, site-specific healthcare delivery patterns, policy changes), and data processing approaches (e.g., concurrent event ordering, feature mapping). This problem and subsequent model generalization is exemplified by current approaches involving EHR data and clinical event sequences.
This dissertation seeks to assess and reduce the impact of dataset shift on the stability of clinical event sequence models, addressing two facets of the problem. First, the research explores a method to learn perturbation-invariant representations of event sequences involving concurrent events by modeling them as a sequence-of-sets, ameliorating the impact of dataset shift caused by inconsistent ordering schemes imposed during pre-processing. With a permutation-sampling-based framework, we enforce perturbation-invariance on a clinical dataset using an additional L1 loss. The proposed framework is tested on a next-visit diagnostic prediction task and shows improved robustness over perturbations in concurrent event ordering shifts. Second, this research develops a domain-invariant representation learning framework using unsupervised adversarial domain adaptation techniques, reducing the impact of dataset shift on a model's target domain performance without requiring any target labels. To improve transfer performance in the unlabelled target domain, the pre-trained Transformer-based framework adversarially learns domain-invariant features that are also beneficial to the discriminative task of next-visit diagnostic prediction. The proposed framework is evaluated for both transfer directions on event sequence datasets from two different healthcare systems and demonstrates superior zero-shot predictive performance on the target data over the non-adversarial baselines.
This dissertation advances our understanding of how dataset shift affects the generalization and stability of clinical event sequence diagnostic prediction models, and offers solutions to reduce its impact in both single-source perturbation and cross-dataset unsupervised transfer learning settings.