- Main
Multi-scale analysis of sequence and regulatory information in Escherichia coli
- Lamoureux, Cameron
- Advisor(s): Palsson, Bernhard
Abstract
Biological information is encoded and transmitted by nucleic acids. Next-generation sequencing technologies have unleashed a flood of large-scale genomics and transcriptomics data capturing this information flow. Here, we develop three analytical frameworks for deriving biological knowledge from this data at multiple scales, using Escherichia coli as a model. First, we introduce the Bitome, a single-base-pair resolution representation of genome annotation information for a genome sequence. This binarized construct highlights the uneven patterning of genomic information. Moreover, we leverage this information representation to classify genes based on adaptive mutability and to quantitatively predict mRNA transcript levels based on promoter sequence. Next, we analyze sequence variation in non-coding regions across 2,350 E. coli strains. We demonstrate that annotated functional non-coding features are significantly conserved. We also highlight the sufficiency of non-coding alleles to segment phylogroups, and contrast adaptive mutations with wild-type variation. Finally, we construct a high-precision, single-protocol 1,035-sample RNA-seq compendium called PRECISE-1K. Using unsupervised machine learning, we extract 201 independently-modulated groups of genes (iModulons) that capture the majority of the known transcriptional regulatory network. iModulons also reveal novel regulons and uncover a binding-site basis for different functional behavior within the same regulon. In combination, this expression and regulatory information constitute a knowledge base that may be applied towards the analysis of new data. As a whole, this work introduces a multi-scale suite of analytical tools that enable study of information flow by converting big data to biological knowledge.
Main Content
Enter the password to open this PDF file:
-
-
-
-
-
-
-
-
-
-
-
-
-
-