McNair, Katelyn

Fixing What’s Wrong: The Problem of Coding Anomalies in Genomic Data

2023

McNair, Katelyn
Advisor(s): Segall, Anca

Creative Commons 'BY-SA' version 4.0 license

Abstract

Genomic sequencing is at the forefront of biological research, in part due to the ever-increasing accessibility of sequencing technology. What used to take an entire lab devoted solely to nucleotide sequencing, can now be accomplished by a single person on a tiny USB device plugged into an old laptop. Around 80% of this data is microbial in nature: originating from Bacteria, Archaea, or viruses. Generally, the goal of this sequencing is to identify and analyze the genes that occur within the DNA (or RNA). This is because it is the genes that carry out the functions and are the building blocks of every cell. Frameshifts, where coding sequences switch between frames on the collinear strand, are ubiquitous in this genomic data, as both artificially induced sequencing error and naturally occurring cellular processes. These frameshifts break the genes into different frames, which confound downstream gene prediction analysis, since all current genome-based analyses of microbes uses methods that are remiss when it comes to identifying these frameshifts. In order to improve current genomic research, we developed software that is able to identify both artificial and natural frameshifts within an input genome using a convolutional neural network to classify fixed-size windows taken from a genome in a scrolling manner. Windows that come from a coding gene are classified as such while windows taken from intergenic regions or out-of-frame with a coding gene are not classified as coding. The individual window predictions are then analyzed by a change-point algorithm to detect when a coding gene begins or ends.

Main Content

For improved accessibility of PDF content, download the file to your device.

UC Irvine

Fixing What’s Wrong: The Problem of Coding Anomalies in Genomic Data