DNA methylation (DNAm) is commonly used to develop aging biomarkers such as predictors of age, mortality risk, and blood cell counts. Challenges arise due to its high-dimensionality, variability in cytosine-phosphate-guanine (CpG) loci coverage across different arrays, and measuring the most relevant tissue DNAm. This dissertation introduces novel approaches to harness DNAm data for biomarker development, imputation accuracy enhancement, and cross-tissue prediction through three interconnected studies.
Because DNAm data are often high dimensional, they require regularized regression frameworks to construct practical prediction models. In the first arm, I developed DNAm-based biomarkers for fitness parameters, like maximum handgrip strength and VO2max, using regularized linear regression. These biomarkers demonstrate significant associations with physical activity across diverse groups, from individuals with low to intermediate activity levels to elite athletes, showcasing their potential for evaluating the epigenetic impacts of physical fitness.
DNAm data has common missingness challenges, and my second arm presents tools that utilize Copula models to enhance imputation accuracy. Unmeasured loci become problematic when DNAm biomarkers require those methylation levels for their algorithm, however, DNAm data do not commonly meet the underlying normality assumption needed for imputation tools. Therefore, we developed algorithms that can improve DNAm imputation by transforming DNAm into gaussian variables using their inherent distribution. While designed with DNAm in mind, our algorithms extend to any continuous variable needing gaussian structure, offering a versatile tool for all research projects.
The final arm explores Transfer Learning (TL) methodologies to facilitate the prediction of DNAm biomarkers across tissues, addressing the limitation of tissue accessibility in biomarker development and measurement. By enabling the use of saliva DNAm to predict blood DNAm biomarker values, this approach significantly broadens the scope of non-invasive epigenetic studies, providing researchers with robust algorithms for cross-tissue biomarker prediction and tools for development of new biomarkers. In doing so, we demonstrate how information from other tissues' DNAm can enhance biomarker prediction, and provide guidelines for researchers to implement our TL methods.
Collectively, this dissertation uncovers novel strategies for extracting valuable insights from high-dimensional DNAm data, contributing new biomarkers for physical fitness, enhancing DNAm imputation methods, and pioneering cross-tissue prediction algorithms. These offer significant advancements integrating epigenetics with the biostatistics field, facilitating a deeper understanding of DNAm and their implications for human health and aging.