Single-molecule long-read sequencing technology recently reached accuracies useful for studying diverse viral genes and genomes. Challenging error profiles, however, hinder the interpretability of long-read sequencing datasets. Here we develop computational tools for processing such datasets and for visualizing rapidly evolving viral populations.
Our primary biological focus is the HIV-1 envelope protein, which is the only target of neutralizing antibodies. An effective HIV-1 vaccine would be a powerful weapon against the current global epidemic, but progress has been slow because Env is a difficult target. Nevertheless, some hosts develop broadly neutralizing antibodies (bNAbs), which could be protective if they could be elicited by vaccination. Env and bNAb lineages co-evolve, so understanding the Env populations and evolutionary dynamics will likely be critical for understanding how to elicit the desired immune response. Tools developed in this dissertation allow, for the first time, accurate processing of full-length sequencing of HIV-1 env populations. Computational challenges in analyzing these sequences include the length of the gene (2.6kb) and the prevalence of indel sequencing errors and extensive biological indel variation which render traditional approaches inaccurate.
FLEA is a pipeline for processing circular consensus sequences and providing biological insights into the evolution of env. It performs sequence cleaning, infers high-quality consensus sequences, and performs analyses including codon alignment, phylogenetic tree inference, ancestor reconstruction, and selection inference. The FLEA pipeline supports multiple cluster and high-performance computing environments. A client-side web application provides interactive visualizations, including a tree viewer, MSA browser, and three-dimensional structure viewer.
RIFRAF is a novel multi-objective sequence consensus algorithm. It uses per-base quality scores and uses a reference sequence for frame correction. RIFRAF consistently finds consensus sequences that are more accurate and in-frame than those from other methods, even with few reads and a distant reference. It is also uniquely capable of keeping true indels while removing spurious ones.
These tools have been used to study donors from the Protocol C primary infection cohort, resulting in two high-profile journal articles and another in preparation. They have also been used to analyze data from a phase-I clinical trial of an anti-Env monoclonal antibody therapy, published in Nature Medicine. This dissertation reviews those articles, focusing on the results obtained with these tools.