As a Ph.D. Candidate at the University of California, Berkeley, working in the lab of Dr. Jillian Banfield, I have dedicated my Ph.D. to the discovery and investigation of novel genes, proteins, and microbial taxa, as well as their relationship to ecologically relevant metabolic processes, through a combination of established and novel bioinformatic approaches, including the development of several novel tools and workflows.The first chapter of this thesis investigates the phylogenetic structure and metabolic capacities of various clades of the phylum Chloroflexota and its neighboring taxa, named the Chloroflexi supergroup. Notably, this analysis revealed numerous novel versions of well-characterized proteins with distinct biochemical capacity, as confirmed by subsequent analyses by other groups utilizing this data. These genes include RuBisCO, a key gene in the most predominant method of CO2 fixation on Earth, the Calvin-Benson-Bassham cycle, a novel variant of type 3 NiFe hydrogenase, and an expanded distribution of a fused photosystem reaction center protein unique to the phylum Chloroflexota. This work was published on biorXiv and is in preparation for resubmission.
The second chapter describes, through a genome-resolved metatranscriptomics approach, the actively expressed metabolic functions of soil microbial communities in a montane hillslope in the East River watershed SFA, Crested Butte, CO, USA. Notably, we find evidence that microbial communities in the soil are stratified into distinct ecosystem subtypes in the shallow hillslope soil, the subsurface below 2m depth, and in perennially submerged soils near to a river. Key taxa are indicative of each ecosystem subtype. The shallow hillslope soil, from a sampling depth of 5 to 80 cm, is characterized by high rates of mRNA expression by Thaumarchaeota, which in spite of their low genomic abundance are the most highly expressing organisms in these samples. The subsurface samples, below 80cm depth, are characterized by methane and alkane oxidizing Actinobacteriota, as well as Archaeal phyla such as Thermoproteota which are not found elsewhere in the hillslope. The perennially submerged soils, located proximal to the East River, notably show the highest rate of transcription for genes involved in fixation of important gasses such as carbon dioxide and dinitrogen, and show high taxonomic similarity to samples taken from river meander soils at this site despite being much higher in elevation. This work has been submitted to Microbiome for review.
The third chapter examines a set of enigmatic predicted giant protein sequences found in genomes of the candidate phylum Omnitrophota, which regularly exceed 30,000 amino acids in size. These sequences contain numerous indications of a potential function in bacterial predation, consistent with previous observations that Omnitrophota containing these proteins predate upon Archaea. Through in silico protein structural analysis, we determined that previously unannotated regions of these giant protein sequences not only contain identifiable enzyme-like structures, but also confidently predicted novel folds stretching across thousands of amino acids; additionally, we developed a novel workflow to utilize existing tools to predict the folds along such regions and investigate their structure, which is impossible with current approaches. We also present curated, circular genomes for organisms of this candidate phylum, and perform a thorough review of the features present on similarly large proteins from other phyla, which are often also predicted to play a role either in symbiosis or predation. This work is available on biorxiv and will be soon submitted to mBio for review.
The fourth chapter presents a new tool, Kuma, made for high-throughput annotation of metagenomic protein sequences as well as retrieval of those sequences for further analysis, as well as an example use case showing the utility of this tool in analyzing hundreds of metagenomes at once. Kuma provides the user with the ability to download numerous published custom HMM databases, gathered in one place for the first time, and to easily search sequence databases with them from the command line. We demonstrate the utility of this tool by using it to profile hundreds of metagenomes with HMM databases such as PFAM and KOFAM. These metagenomes were then analyzed through novel statistical approaches to determine which features were most enriched in particular ecosystem subtypes, and this analysis revealed that taxon-specific domains as well as domains of unknown function were often the strongest indicator features for particular ecosystem subtypes. This work is in preparation for submission to Bioinformatics.
Overall, this dissertation describes the use of a combination of bioinformatic approaches to shed light on unexplored and enigmatic data obtained from metagenome and metatranscriptomic samples and from across the tree of life. This work demonstrates the importance of such data in the pursuit of understanding the various ecosystem processes to which microorganisms contribute in soils, groundwater, and contaminated environments. Crucially, such metagenomic datasets are becoming increasingly abundant, necessitating the development of both high-throughput analysis methods as well as in-depth investigations into novel features and data contained within.