Pan, Zhicheng

Machine Learning Strategies for Alternative Splicing

2021

Pan, Zhicheng
Advisor(s): Xing, Yi

Abstract

Alternative splicing (AS) is a fundamental biological process that diversifies the transcriptomes and proteomes. Aberrant splicing is the main cause of rare diseases and cancers. Our understanding of AS is far from complete, resulting in a limited comprehension of phenotypic effects of splicing dysregulation. Recent advances in next-generation sequencing (NGS) technologies have revolutionized the discoveries of AS. There are considerable efforts put into generating a large compendium of RNA-seq datasets. These datasets offer an opportunity to study the regulation of AS in tissues, cell stages, and perturbation of biological conditions at unprecedented resolutions and scales. However, utilizing the large number of datasets to make biological discoveries remains a challenge. In this dissertation, we developed machine-learning-based strategies to integrate various types of RNA-seq datasets and transform them into biological knowledge, thereby enabling discoveries towards regulatory mechanisms and functional consequences of AS. In the first part of the dissertation, we report a deep-learning-based computational framework, Deep-learning Augmented RNA-seq analysis of Transcript Splicing (DARTS), that utilizes the Bayesian integration of deep-learning-based predictions with empirical RNA datasets to make inference of differential alternative splicing between biological samples. RNA sequencing (RNA-seq) analysis of alternative splicing is largely limited by depending on high sequencing coverage. DARTS transforms large amounts of publicly available RNA-seq datasets into biological knowledge of how splicing is regulated through deep learning, thus enabling researchers to better characterize alternative splicing inaccessible from RNA-seq datasets with modest coverage. In the second part of the dissertation, we present a computational tool, Systematic Investigation of Retained Introns (SIRI), to quantify unspliced introns and describe a deep-learning-based computational framework to investigate the sequence preferences of different intron groups across subcellular locations. Steps of mRNA maturation occur in distinct cellular locations, while subcellular distribution of processed and unprocessed transcripts often miss in transcriptomic analyses. We employed SIRI to measure intron levels in subcellular locations across cell development and identified four intron groups that have disparate patterns of RNA enrichment across subcellular locations. Through the deep-learning based framework, we identified a set of triplet motifs and sequence conservation patterns that are predictive of intron behavior among biological conditions. In the third part of the dissertation, we exhibit a deep-learning-based tissue-specific framework, individualized Deep-learning Analysis of RNA Transcript Splicing (iDARTS), for predicting splicing levels. The rapid accumulation of RNA-seq datasets matched with whole exome or genome sequencing yields enormous variants underlying diseases, traits, and cancer. Interpreting the functional consequences of these variants remains a challenge in disease diagnostics and precision medicine. iDARTS leverages the publicly available RNA-seq datasets to model the cis RNA sequence features and trans RNA binding protein levels determinants of AS, allowing precise predictions of genetic splice-altering variants. We demonstrated that predicted splice-altering variants are functionally relevant and related to cancer development when analysing ~10 million intronic and exonic variants with iDARTS. Applying iDARTS to interpret functional consequences of variants of uncertain significance in clinical studies, we found that predicted splice-altering variants are ten times enriched in pathogenic categories over benign categories. Our results indicate that iDARTS will benefit large-scale screening disease-implicated variants, thus improving disease diagnosis and enabling discoveries for precision medicine. In the fourth part of the dissertation, we study the underlying mechanisms of N6-methyladenosine (m6A) RNA modification by investigating the biological consequences of arginine methylation of METTL14 through transcriptome-wide profiling of m6A. Arginine methylation of METTL14 controls m6A deposition in mammalian cells. Mouse embryonic stem cells (mESCs) expressing arginine methylation-deficient METTL14 exhibit significantly reduced global m6A levels. These arginine methylation-dependent m6A sites identified from transcriptome-wide analysis are associated with enhanced translation of genes essential for the repair of DNA interstrand crosslinks. Collectively, these findings reveal important aspects of m6A regulation and new functions of arginine methylation in RNA metabolism.

Main Content

For improved accessibility of PDF content, download the file to your device.

UCLA

Machine Learning Strategies for Alternative Splicing