Penalization and sparse model selection have become topics of intense research interest in the era of big data, newly available through ubiquitous computing power, advancing data collection technologies, and internet connectivity. In genomics, chromatin immunoprecipitation, microarrays, and next generation sequencing have made available a wealth of information which continues to accumulate and which we have only begun to understand and fully utilize. We propose two penalized Bayesian tech- niques, one to select a sparse set of DNA binding factors (DBFs) from a large library with enriched binding to the genome in a set of regions of interest and to predict joint binding landscapes for the selected DBFs, and another to predict gene expression from joint binding landscapes.
Cellular processes are controlled, directly or indirectly, by the binding of hundreds of different DBFs to the genome. One key to deeper understanding of the cell is discovering where, when, and how strongly these DBFs bind to the DNA sequence. Direct measurement of DBF binding sites (e.g. through ChIP-Chip or ChIP-Seq experiments) is expensive, noisy, and not available for every DBF in every cell type. Naive and most existing computational approaches to detecting which DBFs bind in a set of genomic regions of interest often perform poorly, due to the high false discovery rates and restrictive requirements for prior knowledge.
We develop a penalized iterative sampling Bayesian method for identifying DBFs active in the considered regions and predicting a joint probabilistic binding land- scape. Utilizing a sparsity penalization, SparScape is able to select a small subset of DBFs with enriched binding sites in a set of DNA sequences from a much larger candidate set. This substantially reduces the false positives in prediction of binding sites. Analysis of ChIP-Seq data in mouse embryonic stem cells (ESCs) and simu- lated data show that SparScape dramatically outperforms the naive motif scanning method and the comparable computational approaches in terms of DBF identification and binding site prediction.
We also propose an extension of Bayesian treed regression to predict gene expres- sion from joint binding landscapes. Rather than sampling from the space of possible partitioning trees, we follow a broad optimization approach, forking the growing partitioning tree at each possible split if multiple possible splits yield similar results in the given objective function. After growing the tree, we select variables at each leaf node of each forked partitioning tree, then take the union of these selected variables and the splitting variables at each internal node and re-grow the partitioning tree considering only the selected variables.