Making Sense of Protein Representation Learning in the Wild
- Bhattacharya, Nicholas
- Advisor(s): Song, Yun S.;
- Evans, Steven N.
Abstract
Data-driven systems used to inform design or characterization of proteins in the wet-labtypically face two challenges. First, relevant labeled data are scarce due to cost, lack of public availability, or novelty of the experiment in question. Second, each round of experimentation requires significant investment of resources and labor, putting pressure on developers of data-driven systems to achieve reasonable results with very little data. The paradigm of transfer learning via self-supervised pretraining presents a possibility for addressing both of these challenges. The key insight leveraged by this approach is that labeled data are rare, but unlabeled protein sequence data are abundant and easily downloaded. Self-supervised pretraining allows large-scale neural networks to extract a library of patterns from these unlabeled data and potentially wield those patterns on small labeled datasets. In this work, we explore the self-supervised pretraining approach from a number of angles and develop a broader picture of its efficacy. We first develop a benchmark that allows for informative comparison of pretrained models. Next, we demonstrate mathematical and empirical connections between the predominant architecture used for pretrained neural networks, the transformer, and established protein-modeling techniques based on Potts models. Lastly, we introduce a novel experiment that measures antibody-antigen binding for various antibodies against the model antigen Ovalbumin. Use of hundreds of alanine-variants of Ovalbumin leads to higher resolution than existing approaches. We develop a novel clustering algorithm to analyze these data and test the ability of pretrained models to predict antibody-antigen binding measured experimentally.