- Main
Harnessing AI/ML for Proteomics: Post-Translational Modification Prediction and Proteome Turnover Imputation
- Yan, Yu
- Advisor(s): Ping, Peipei
Abstract
The field of proteomics, encompassing the exhaustive study of proteins, their structures, functions, post-translational modifications, dynamics, and interactions, stands as a crucial domain in the quest to understand biological systems and disease mechanisms. The rise of high-throughput technologies, notably mass spectrometry, has exponentially increased the volume and complexity of proteomic data, posing both opportunities and challenges in large-scale data analysis and interpretation. In this context, the integration of Artificial Intelligence (AI) and Machine Learning (ML) methodologies presents a transformative strategy, promising to significantly enhance various facets of data analysis in proteomics. This dissertation is dedicated to exploring the application of AI/ML in the domain of proteomics.
The first theme of this dissertation introduces MIND-S, a deep-learning platform designed to predict protein post-translational modifications (PTMs). MIND-S utilized protein sequence and structure, modeling through combination of a transformer model and a graph neural network to efficiently predict multiple PTMs. It features an interpretation module that discerns the relevance of amino acids and uncovers PTM patterns without direct supervision. Additionally, it assesses the effects of mutations on PTMs and has been validated using biological data. This work demonstrates MIND-S's accuracy and efficiency in analyzing PTM processes in both health and disease.
The second theme delves into gene representation through a comprehensive, task-agnostic approach, aiming for a holistic understanding of molecular events. Traditional gene embeddings often have a narrow focus on specific tasks, missing the broader picture. This study evaluates nine gene embeddings across three categories: experimental, literature, and knowledge graph data. Using Singular Vector Canonical Correlation Analysis (SVCCA), it reveals that the representations contain unique, minimally overlapping information, fostering rich, multifaceted embeddings. This method outperforms task-specific approaches in various benchmark tests and successfully imputes missing data, enhancing individual embeddings. It offers a robust framework for comprehensive biomolecule characterization, with significant benefits for biomedical AI applications.
The third theme addresses the challenge of missing values in temporal proteomics datasets, which can obscure critical measurements and impair the understanding of biomedical processes. To address this, a Data Multiple Imputation (DMI) pipeline was developed to facilitate robust analysis of protein turnover rates in time-series data. This approach was applied to murine cardiac and human plasma datasets, greatly improving the detection of protein turnover rates and uncovering new biological insights. The imputed data provided a more comprehensive depiction of proteins, enhancing the understanding of biological pathways and disease associations. Notably, DMI outperformed single imputation methods in benchmark evaluations, demonstrating its effectiveness in managing missing data challenges in temporal proteomics.
Main Content
Enter the password to open this PDF file:
-
-
-
-
-
-
-
-
-
-
-
-
-
-