Skip to main content
eScholarship
Open Access Publications from the University of California

UCLA

UCLA Electronic Theses and Dissertations bannerUCLA

Modeling Site-Site Dependency in DNA Methylation Sequencing data

Abstract

DNA methylation is a crucial epigenetic modification on CpG sites, influencing gene expression and cellular function. Conventional analyses often neglect the intrinsic dependencies between adjacent CpG sites, limiting insights into underlying biological mechanisms and constraining their broader applicability. This thesis aims to model the site-site dependency in DNA methylation sequencing data using two complementary methodologies: a statistical approach using heterogeneous Hidden Markov Models (HMMs) and a machine learning approach employing Bidirectional Long Short-Term Memory (BiLSTM) networks.

The heterogeneous HMM extends the classical homogeneous HMM by incorporating genomic distance into transition probabilities, reflecting the biological intuition that adjacent CpG sites with closer proximity exhibit stronger dependencies. A parameter estimation procedure utilizing the Expectation-Maximization algorithm is derived to handle this extension. Simulation studies demonstrate that the heterogeneous HMM outperforms the homogeneous HMM in model fitting, parameter estimation accuracy, and capturing distance-related dependency patterns. When applied to whole-genome bisulfite sequencing (WGBS) data, the heterogeneous HMM provides a more accurate representation of methylation patterns, effectively capturing the diminishing dependency as genomic distance increases.

To address the limitations of HMMs in capturing complex and long-range dependencies, the thesis also introduces a deep-learning approach using BiLSTM networks. This model leverages the recurrent neural network architecture to implicitly learn sequential dependencies in both forward and backward directions. By incorporating a rich set of features—including methylation levels, genomic distances, and sequence context embeddings—the BiLSTM simultaneously captures marginal methylation probabilities and preserves site-site dependencies. Simulation studies and WGBS data analyses demonstrate its superiority over both homogeneous and heterogeneous HMMs in accurately aligning with marginal methylation levels and effectively preserving the intricate dependency patterns.

These two complementary approaches enhance the ability to characterize methylation pattern dynamics observed in real data. The heterogeneous HMM offers interpretability in modeling distance-dependent dependencies, while the BiLSTM provides flexibility to incorporate various features and capture complex dependency patterns. The thesis also outlines future directions to enhance these methodologies, including applying the frameworks to diverse datasets for broader generalizability, improving the computational scalability of the heterogeneous HMM for large-scale datasets, leveraging explainable machine learning techniques to identify key features driving methylation concordance, and exploring advanced generative models such as transformers and diffusion models for methylation pattern modeling. By effectively capturing site-site dependencies, these methods show promise for practical applications, such as imputing missing values in sparse datasets and improving the detection of differentially methylated regions, ultimately advancing biological understanding and translational potential of epigenetics.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View