Technology guides the practice of scientific inquiry. In the biological sciences, DNA sequencing has encouraged the commingling of traditional experimental biology and computer science. In this thesis, I describe computational and biochemical methods to advance nanopore DNA sequencing technology. Nanopore DNA sequencing is an single-molecule technology that shows promise in the area of read lengths, instrument portability, and, as shown in this work, chemical modification detection. A nanopore sequencing device contains a nanometer-sized pore that separates two electrolyte buffer reservoirs. A voltage potential is applied across the nanopore and the device records the ionic current through pore. As DNA polymers translocate through the pore they modulate the ionic current by partially obstructing the pore in a sequence-dependent manner. Commercial sequencing devices contain arrays of pores that are individually controlled.
The first chapter of this work provides a technical background and practical application of the technology. In the following chapters I describe new algorithms that improve the utility of the technology. In Chapter 2 I describe a hidden Markov model for the nanopore ionic current. I describe how multiple model topologies were implemented, one of which included modeling the time domain. I describe how a hierarchical Dirichlet process can be used to model new non-canonical (modified) bases affording a HMM-HDP model. In chapter \ref{chap:detecting_modifications} I show how the model can be trained to detect multiple chemical modifications on synthetic DNA and genomic DNA samples. In the final section of this chapter I show how the model was incorporated into a cloud-based pipeline that allows for horizontal scaling of the datasets.
The last section of the paper describes a biochemical method to re-read DNA sequences using a nanopore and a helicase enzyme. Multiple biochemical techniques have been used to increase the accuracy of nanopore sequencing; re-reading the same strand allows for stochastic errors to be differentially placed in multiple passes.