Our world is becoming quantifiable. The IBM Corporation estimates that our society is collecting data at a rate of 2.5 quintillion bytes per day. To give some perspective, 90% of the data that humans now have access to has been collected in the last two years. Biological science is no exception. This information holds enormous potential but the biggest challenge now lies in data analysis and interpretation.
In biology, this data revolution has been led by sequencing technology. Therefore most data is either genomic or transcriptomic in nature. This dissertation focuses on protein mass spectrometry. We find that by integrating multiple data sets, we achieve the most powerful systems-level descriptions of biological systems. In the following dissertation we show how proteomic data can be integrated with both transcriptomic and epigenomic data sets to provide critical insight into biological systems.
In the first chapter, we show that proteomic and transcriptomic measurements have fundamental differences and lead to different specific results but similar “big-picture” conclusions. We use both to re-construct gene regulatory networks and find that the most accurate network results from integrating both data types.
In the second chapter, we expand on observations made in chapter one and incorporate DNA methylation data. We discover that, using random forest machine learning models and genic DNA methylation data, we are able to classify the subset of expressed mRNAs with high accuracy. Most interestingly, after incorporating proteomic data, we achieve near perfect classification accuracy and go on to discover a surprising association between genic DNA methylation and translations. Such models can be used to annotate the functional subset of maize genes with equal or better accuracy than current manual annotations. These models show excellent accuracy in a diverse set of maize inbreds, leading to speculation that DNA methylation is playing a large role in crop domestication.
In the final chapter we use a novel method to integrates protein and transcript data to discover quantitative trait loci that are specifically controlling protein abundance in a mRNA independent manor in arabidopsis. We then demonstrate how transcript data can be used to prioritize causative genes.