Biology at scale: towards angstrom-level resolution across tens of thousands of bacteria strains
- Catoiu, Edward Alexander
- Advisor(s): Palsson, Bernhard Ø
Abstract
With the growing number of publicly available bacterial genome sequences, the ability to resolve the shapes of increasingly complex proteins through advancements in microscopy and protein-folding prediction software, and the mechanistic insight provided by genome-scale models (GEMs), microbial biology is rapidly entering the digital age. Tens of thousands of whole genome sequences can now be simultaneously analyzed, and a full accounting of the gene content and genetic variation within a bacteria can be assessed. A sequence variant (mutation) can be mapped onto the 3D protein structure, where its proximity to enzymatic domains or structural features may offer a physio-chemical basis for its observed effect. Structures that reflect the multi-subunit nature of a protein can be incorporated into genome-scale models to obtain an angstrom-level understanding of whole-cell functions. Undoubtedly, interoperable workflows that offer angstrom-level resolution across a scale that spans thousands of genomes will usher in a new generation of analytical tools, with implications for evolutionary, structural, and systems biology. In this dissertation, I build such workflows and present findings that can only be revealed at this new scale of biological data. First, I quantify the sequence variation in Escherichia coli by defining its “alleleome” – the collection of all alleles of all genes found in the whole genome sequences of 2,661 wild-type strains – and find extensive differences between wild-type and laboratory-evolved strains. Second, I generate the Quaternary Structural Proteome Atlas of a Cell (QSPACE) – an oligomeric structural representation of the cellular proteome – for E. coli and use interoperable residue-level data (e.g., mutations, functional domains, subcellular compartments) to analyze sequence variants and to generate a draft image of an optimal cell. Third, I generate alleleomes for 184 bacterial species (from 54,191 strains) and reveal characteristics of the evolutionary history of modern-day bacteria. Taken together, this dissertation describes foundational interoperable workflows that vastly expand the scale and resolution at which microbes can now be studied.