Motivation: Nanopore-based sequencing techniques can reconstruct properties of biosequences by analyzing the sequence-dependent ionic current steps produced as biomolecules pass through a pore. Typically this involves alignment of new data to a reference, where both reference construction and alignment have been performed by hand.
Results: We propose an automated method for aligning nanopore data to a reference through the use of hidden Markov models. Several features that arise from prior processing steps and from the class of enzyme used can be simply incorporated into the model. Previously, the M2MspA nanopore was shown to be sensitive enough to distinguish between cytosine, methylcytosine, and hydroxymethylcytosine. We validated our automated methodology on a subset of that data by automatically calculating an error rate for the distinction between the three cytosine variants, and show that the automated methodology produces a 2–3% error rate, lower than the 10% error rate from previous manual segmentation and alignment.
Availability: The data, output, scripts, and tutorials replicating the analysis are available at https://github.com/UCSCNanopore/ Data/tree/master/Automation.