High-throughput sequencing (HTS) can identify unique DNA sequences and quantify their abundances from mixed DNA pools. HTS-based assays can profile complex biological or chemical systems with entities that can convert to unique DNA sequences. Computational models are also developed to analyze these HTS data at a larger scale. However, such data contain unique analytical challenges, including discrete counts, relative measurement, and small sample size. Careful assessments of these computational tools are required for robust interpretations of results.
In this dissertation, we investigated the computational challenges, proposed and assess the solutions for two applications of HTS-based assays. In the first work, we proposed k-Seq, a kinetic assay to measure the activity of self-aminoacylation ribozymes (catalytic RNA). Characterizing the kinetics for different molecules in a heterogeneous pool is challenging as their abundance and activities can vary in several orders of magnitude. We explored different designs of experiments and identified critical factors affecting the estimation of kinetic coefficients in the pseudo-first-order kinetic model for these ribozymes. Using bootstrapping, we robustly quantified the uncertainty of estimation for individual sequences and determined the minimum sequencing counts required for reliable estimations. Combining the improved experimental design and new analytical tools, we robustly quantified the kinetics for 10^5 different ribozymes.
In the second work, we constructed the correlation networks between microorganisms from metagenomic data and studied the structure of a human skin microbiome in patients with chronic wounds. We designed a variation of Gaussian graphical models to capture the direct correlations between the abundances of bacteria and viruses while accounting for the structure and limitations in the data. To minimize the discovery of false correlations from the small noisy dataset, we applied a two-step model selection to regularize the results. Lastly, we demonstrated the utility of the constructed correlation network in recovering the strong correlations between microbes, identifying potentially important microbes, and microbial clusters.