Understanding how exposures from our environment, diet, and lifestyle interact with unique genetic, physiologic, and epigenetic profiles to impact health is a main objective of environmental health sciences. However, causal inference is a formidable challenge in many environmental health contexts such as mixed exposures, multiple mediating pathways, heterogeneity in exposure effects, and interactions in high-dimensional data. Existing causal inference methodologies in these settings make too many simplifying assumptions that do not represent complex real-world patterns. Commonly used statistical methods based on general linear models (GLMs) fail to untangle the exposures truly affecting health due to multi-collinearity, high-dimensional interactions, and complex joint distributions. Today's scientific endeavors in environmental health require adoption of new non-parametric methods using flexible machine learning for causal inference of mixed exposures, mixture-mediation, and heterogeneity of exposure effects. Causal inference in these arenas can answer critical questions such as: What mixture of metal exposures during pregnancy influence maternal and child health? At what levels are these impacts most severe? How do oxidative stress and inflammatory biomarkers mediate action mechanisms of these exposure interactions? How do we both identify parts of a mixture that are important and estimate the expected outcome if this part of the mixture changed? Are there certain subpopulations that are more susceptible to changes in parts of a mixture?
Unlike many setting such as medicine where the treatment/exposure is known a priori and propensity score-based methodologies can be deployed with relative ease, the issue with mixtures is that, not only are many measured on the continuous scale (where propensity score methodologies break down) but that there are many of these exposures. We do not have the expected outcome under every combination of multiple continuous exposures. Even if this were possible, still some interpretable representation of this gradient is necessary. As such, subspaces of the exposure or specific variable subsets of the mixture that are impactful on the outcome must be identified and are not known a priori. We must use the data to both identify these mixture regions and derive estimates given exposure to this region. This requires data adaptive target parameters, or the mapping of a mixture into a lower dimensional exposure in one part of the data and estimation of a target parameter given this exposure is done in another part of the data. Data adaptive target parameters therefore provide a unifying framework for causal inference mixture problems, each non/semi-parametric method presented leverages data adaptive target parameters to first find areas of the mixture space that are most impactful and then estimate a target parameter given that space. This dissertation is divided into five chapters, each aiming to estimate causal inference of a mixture under the larger theory of data adaptive target parameters which also extends to decomposing effects into mediating pathways, estimates of heterogeneity, and interaction.
Statistical advancement in estimating mixtures is key to furthering the progress of environmental health science to understand the health impacts of environmental exposures. Current statistical methodology lacks the ability to realistically capture the complexity of mixed exposures in an interpretable and informative summary measure. The question then becomes what is the mapping of multiple continuous, multinomial, and/or binary exposures into an interpretable summary measure and what estimation given that summary measure are we interested in? The different statistical aims provided represent different ways of answering this question. Each can be thought of as a statistical machine where the analyst simply inputs the data for exposures, covariates and an outcome and the rest is automatic. From the data the impactful areas of the mixture are identified using the best fitting model chosen from an ensemble and a target parameter is estimated with proper estimates of variance for this mixture subregion. In this way, rather than relying on human choice of modeling which introduces bias, results are data-driven.
Chapter 1 considers the problem of both identifying exposure variables and thresholds of these variables in a mixture and estimating the expected outcome if individuals were all exposed to this exposure combination compared to if they were not. To meet this challenge, the best fitting decision tree from an ensemble is treated as a data-adaptive parameter. Using the subregions of the mixture delineated by the tree which best explains an outcome, we then develop an estimator which compares the expected outcome if all individuals were exposed to this region compared to unexposed while flexibly adjusting for covariates. We apply this novel approach to the NIEHS synthetic mixtures data which allows us to compare interactions identified and estimated in the mixture to ground-truth interactions built into the data-generating system. Furthermore, we apply our method to National Health and Nutrition Examination Survey (NHANES) data to understand what metal mixtures, if any, contribute to shorter leukocyte telomere length. Telomeres are sensitive to various environmental factors, including exposure to metals and metal mixtures. Several studies have explored the relationship between metal exposures and telomere length, particularly in occupational and environmental settings.
With both synthetic and real-world data we compare our findings to other commonly used mixture methods. Our goal is to show that when using other methods to test for interactions, the combinatorial problem explodes, reducing power. The analyst may be interested in testing the effects of different possible interactions in the mixture, the question becomes what degree of interaction? What variables are included in the interaction? Does the definition of the interaction even make sense? Quickly it becomes clear that comparing results to our approach is difficult because other methods require user choice of model parameters which may induce bias. Our approach automatically identifies the correct interactions built into the data generating process whereas methods like quantile g-computation require users to select interaction, which are not known. Therefore these interactions are missed and estimates are incorrect.
Chapter 2 examines new semi-parametric definitions for interaction and effect modification that exist outside the scope of linear modeling. Consider the analyst is interested in assessing for interactions using a GLM; the question becomes what do the beta coefficients in front of this interaction term mean if the model itself is inherently misspecified? We need a definition of interaction and effect modification that can be estimated from a large class of non/semi-parametric functions which best estimate nonlinearities in a mixture. Even with these definitions, we need a method that both identifies variable sets used in the best fitting estimator, selected from a large class of flexible functions, and then applies these interaction and effect modification target parameters to these variable sets. Here we again rely on the general framework of data-adaptive target parameters. We expand work done in stochastic interventions to create definitions for interaction and effect modification and use the same sample splitting techniques used in Chapter 1 to identify variable sets in one part of the data and apply our target parameters in another. Again, we apply our this method to NHANES data to investigate the interactions in persistent organic pollutants (POPs) on leukocyte telomere length. We focus on POPs because this dataset is publicly available and has been used in mixtures workshops. This allows us to compare our findings to those published on this dataset. Although this example and the NIEHS synthetic data focus on interaction Chapter 2 also investigates heterogeneity of treatment/exposure effects. For instance, our causal target parameter in Chapter 1 is the average regional exposure effect or the average difference in outcomes if all individuals were exposed to a subspace of the mixture compared to if no individuals were exposed to this subspace. Likewise, in Chapter 2, for the marginal case, we are interested in the expected disease outcome if say exposure to certain metals decreased by 1 nanogram; we then compare this expected outcome to the outcome under observed metal levels (not decreased). In both situations we are averaging across our sample but what if certain subpopulations exist whose impacts are much greater? After a target parameter, which approaches the truth at a certain rate, is determined, how do we find regions in the covariate-exposure space where these impacts vary the most? Here, we are interested in identifying populations that are vulnerable to a mixed exposure. Chapter 2 also describes a novel approach for finding types of people who are differentially impacted by chemical exposures.
Chapter 3 extends the data-adaptive work for variable set identification and stochastic interventions developed in Chapter 2 used for interaction and effect modification discovery and estimation. Chapter 3 describes how, using the same framework, mediating pathways can be discovered and estimated. Mediation analysis in causal inference has traditionally focused on one binary exposure using deterministic interventions, decomposing the average treatment effect into direct and indirect effects through one mediator. As discussed, in more realistic exposure settings, individuals are exposed to multiple continuous valued exposures that have effects on health outcomes through different mediating pathways. The exposures that impact health outcomes and their possibly mediating pathways are unknown a priori in most instances. Even if the analyst wants to test an exposure-mediator pathway based on domain knowledge, this may not be the strongest pathway in the underlying data. To address this, we propose a methodological framework that both identifies exposure-mediation pathways and delivers unbiased estimates for direct (not through a mediator) and indirect (through a mediator) effects given intervention on exposure subsets. Our approach follows the same framework described in Chapter 2 but estimates direct and indirect effects in the presence of high-dimensional continuous, binary, and categorical exposures and mediators. To uncover the exposure-mediation pathways, we propose a cross-validation procedure where in the path identification portion of the data, sequential semi-parametric regressions, one for mediators given exposures and covariates, and another for the outcome given exposure, mediators, and covariates are applied to find pathways. In the estimation portion of the data, we apply stochastic interventions to exposures with targeted learning to create efficient estimators based on flexible regression techniques. Our efficient estimator is asymptotically linear under a condition requiring n^1/4‐consistency of certain regression functions.
Chapter 4 discusses the importance of maintained open source software which makes new methodologies available and reproducible for analysts. We discuss the two software packages which house the proposed methods. The first, called CVtreeMLE, stands for cross-validated decision trees with targeted maximum likelihood estimation, and makes the statistical causal inference parameters in Chapter 1 available to researchers. The second, SuperNOVA, which stands for Analysis of Variance using Super Learner, makes the statistical causal inference parameters in Chapter 2 and Chapter 3 available. Chapter 5 concludes with a discussion on the future of statistical research using data-adaptive target parameters.