Advances in high-throughput experimental technologies now allow for the probing of the fitness of up to millions of biological sequences. Parallel developments in machine learning (ML) techniques set the stage for a new class of ML-based computational approaches to biological sequence engineering. These approaches can largely be regarded as solving one of two related problems: forward and inverse problems. In a forward problem, the aim is to estimate a fitness function from high-throughput experimental data. In an inverse problem, the aim is to use a predictive model of fitness to design a sequence or library of sequences that satisfies a particular engineering goal. In this dissertation, we will discuss various aspects of both of these sets of problems, and propose solutions for certain scenarios.
First, we present a method for solving one instantiation of an inverse problem. This method, Conditioning by Adaptive Sampling (CbAS), is designed to optimize models of protein fitness when those models make biased predictions in regions of sequence space far from the set of training data. CbAS is an example of a broad class of optimization methods known as Estimation of Distribution Algorithms (EDAs). We show that many EDAs can be viewed as a form of Expectation-Maximization (EM), a popular inference algorithm in probabilistic ML.
Next, we explore the forward problem in more depth. It is an open problem to characterize how much data is needed for accurate estimation of fitness functions. There is a growing body of evidence demonstrating that empirical fitness functions display substantial sparsity when represented in terms of epistatic interactions. Moreover, the theory of Compressed Sensing provides scaling laws for the number of samples required to exactly recover a sparse function. Motivated by these results, we study the sparsity of fitness functions sampled from a generalization of the NK model, a widely-used random field model for fitness functions. In particular, we present theoretical results that allow us to test the effect of the Generalized NK (GNK) model's interpretable parameters---sequence length, alphabet size, and assumed interactions between sequence positions---on the sparsity of fitness functions sampled from the model, and use these results to determine the number of measurements required to exactly recover GNK fitness functions. Further, we show that GNK fitness functions with parameters set according to protein structural contacts accurately approximate the sparsity of empirical fitness functions and can be used to approximate the number of experimental measurements required to effectively estimate a protein fitness function.
Finally, we describe an end-to-end modeling approach for designing a library of insertion sequences to the Adeno-Associated Virus (AAV) capsid with the goal of improving the ability of the capsid to fold and package the viral genome. AAV is a promising vector for delivering gene therapies and packaging is a requirement for effective delivery to a target tissue; a library with improved packaging ability is more likely to infect these tissues in a downstream experiment. Our approach to library design requires solutions to both forward and inverse problems. In particular, we first fit a predictive model to a set of high-throughput experimental data that reports on packaging fitness. We then present a novel method to design libraries that optimally balance average fitness and sequence diversity, and apply this method to the design of AAV insertion sequence libraries.
In summary, this dissertation presents a number of ML-based techniques for biological sequence engineering, as well as theoretical insights that can guide future methodological developments. This work provides further motivation for the biological community to adopt machine learning as a tool to be used alongside experiments in order to effectively engineer sequences.