Biological diversity can be defined as the total variation of life across levels of biological organization from genes/cells to communities/ecosystems. Exploiting the observed diversity can be of vital interest for environmental, or clinical applications as it may translate into improved responses in community management or patient treatment. Advancements in biological data acquisition technologies such as next-generation sequencing, tandem mass spectrometry or cell imaging enabled scientists explore diversity in complex samples. The high volume of data, however, created the need for efficient and sensitive computational techniques, to perform useful analyses. In this dissertation, I present three studies, where we explore the presence and the level of biological diversity together with the computational tools and analyses developed for three different data modalities.
First, I describe our computational analysis of the bacterial small subunit rRNA (16S) and the eukaryotic internal transcribed spacer 2 (ITS2) sequencing data of industrial scale open algae ponds, where we explored the associations of community composition and ecosystem variables, over a year. We found that periods of high eukaryotic diversity were associated with high and more stable biomass productivity.
Second, I present ProteoStorm, our computational workflow on performing efficient and sensitive peptide identifications of metaproteomics samples on massive microbial protein databases. Our approach focuses on efficiently reducing the set of candidate peptides for each spectrum, thus obtaining 100 to 1000-fold speedup at the expense of minimal sensitivity. Our re-analysis of urinary tract infection datasets using a comprehensive database, identified bacteria genera previously unknown to be associated with said samples.
Last, I present our study on the landscape of extrachromosomal DNA (ecDNA) in human cancer, where we employed whole-genome sequencing, structural modelling and cytogenetic analyses of 17 different cancer types, including metaphase of 2,572 dividing cells. I focus on the exploration of the presence and diversity of ecDNA in tumor cells, which we conducted using ECdetect, an image anaysis software I developed. We discovered that ecDNA was found in nearly half of human cancers, and was almost never found in normal cells. Using ECdetect, we were also able to provide estimations on the ecDNA count diversity in tumor cell lines.