"In recent years, High Throughput Sequencing (HTS) has become a revolutionary tool in biomedical research. It has impacted our knowledge of basic machineries of the cell and affected human health and medicine. The ever-growing amount of data generated by this technology has necessitated the development of novel algorithms and computational approaches that can efficiently process this data. Among other influences in biology and medicine, the alliance of High Throughput Sequencing data and novel computational methods have changed our understanding of RNA and its role in cell development and human diseases. These methods benefit from a wide range of mathematical, statistical, probabilistic techniques to help study the problems in RNA biology. Handling the noise and biases of sequencing data, identifying unknown RNAs that play a role in different mechanisms, modeling RNA regulation and modification and capturing the features involved in these processes are all problems that can be addressed by integrating computational approaches into RNA biology. In this thesis, we focus on developing such computational approaches and models to provide efficient but comprehensive tools and sources for solving some of the main problems in RNA biology.
First, we apply Gaussian Processes to the temporal profiles of RNAs, generated by HTS data, to assemble a comprehensive catalogue of long non-coding RNAs (lncRNA) present in the early stages of embryo development. These lncRNAs are likely to play a role in cell differentiation. Second, to address the effect of aberrant RNA processing in health and cell development, we curate a database of exons conserved in both sequence and length through. We demonstrate the potential of ultra-conservation as a better predictor of alternative splicing of RNA, and we show its application in studying mutations that affect splicing and lead to disease. Third, we introduce PASARNA, a tool for RNA polyadenylation analysis. This tool exploits mathematical and probabilistic change point detection techniques to find the polyA sites from RNA-Seq data. It also takes advantage of the information extracted from the sequence by a Convolutional Neural Network. And finally, we describe our probabilistic model of “CHRYON” technology (Cell HistorY Recording by Ordered iNsertion) that is designed for recording non-genetic information like lineage or environmental stress into a genomic sequence using a self-targeting RNA". Using this model, we assess the potential of CHRYON and study its characteristics.”