Search

Article
Peer Reviewed

Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics

UC Berkeley Previously Published Works (2015)

We introduce a new representation and feature extraction method for biological sequences. Named bio-vectors (BioVec) to refer to biological sequences in general with protein-vectors (ProtVec) for proteins (amino-acid sequences) and gene-vectors (GeneVec) for gene sequences, this representation can be widely used in applications of deep learning in proteomics and genomics. In the present paper, we focus on protein-vectors that can be utilized in a wide array of bioinformatics investigations such as family classification, protein visualization, structure prediction, disordered protein identification, and protein-protein interaction prediction. In this method, we adopt artificial neural network approaches and represent a protein sequence with a single dense n-dimensional vector. To evaluate this method, we apply it in classification of 324,018 protein sequences obtained from Swiss-Prot belonging to 7,027 protein families, where an average family classification accuracy of 93%±0.06% is obtained, outperforming existing family classification methods. In addition, we use ProtVec representation to predict disordered proteins from structured proteins. Two databases of disordered sequences are used: the DisProt database as well as a database featuring the disordered regions of nucleoporins rich with phenylalanine-glycine repeats (FG-Nups). Using support vector machine classifiers, FG-Nup sequences are distinguished from structured protein sequences found in Protein Data Bank (PDB) with a 99.8% accuracy, and unstructured DisProt sequences are differentiated from structured DisProt sequences with 100.0% accuracy. These results indicate that by only providing sequence data for various proteins into this model, accurate information about protein structure can be determined. Importantly, this model needs to be trained only once and can then be applied to extract a comprehensive set of information regarding proteins of interest. Moreover, this representation can be considered as pre-training for various applications of deep learning in bioinformatics. The related data is available at Life Language Processing Website: http://llp.berkeley.edu and Harvard Dataverse: http://dx.doi.org/10.7910/DVN/JMFHTN.

Cover page: Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics

Creative Commons 'BY' version 4.0 license

Article
Peer Reviewed

Deciphering the Role of Emx1 in Neurogenesis: A Neuroproteomics Approach

UC San Francisco Previously Published Works (2016)

Emx1 has long been implicated in embryonic brain development. Previously we found that mice null of Emx1 gene had smaller dentate gyri and reduced neurogenesis, although the molecular mechanisms underlying this defect was not well understood. To decipher the role of Emx1 gene in neural regeneration and the timing of its involvement, we determine the frequency of neural stem cells (NSCs) in embryonic and adult forebrains of Emx1 wild type (WT) and knock out (KO) mice in the neurosphere assay. Emx1 gene deletion reduced the frequency and self-renewal capacity of NSCs of the embryonic brain but did not affect neuronal or glial differentiation. Emx1 KO NSCs also exhibited a reduced migratory capacity in response to serum or vascular endothelial growth factor (VEGF) in the Boyden chamber migration assay compared to their WT counterparts. A thorough comparison between NSC lysates from Emx1 WT and KO mice utilizing 2D-PAGE coupled with tandem mass spectrometry revealed 38 proteins differentially expressed between genotypes, including the F-actin depolymerization factor Cofilin. A global systems biology and cluster analysis identified several potential mechanisms and cellular pathways implicated in altered neurogenesis, all involving Cofilin1. Protein interaction network maps with functional enrichment analysis further indicated that the differentially expressed proteins participated in neural-specific functions including brain development, axonal guidance, synaptic transmission, neurogenesis, and hippocampal morphology, with VEGF as the upstream regulator intertwined with Cofilin1 and Emx1. Functional validation analysis indicated that apart from the overall reduced level of phosphorylated Cofilin1 (p-Cofilin1) in the Emx1 KO NSCs compared to WT NSCs as demonstrated in the western blot analysis, VEGF was able to induce more Cofilin1 phosphorylation and FLK expression only in the latter. Our results suggest that a defect in Cofilin1 phosphorylation induced by VEGF or other growth factors might contribute to the reduced neurogenesis in the Emx1 null mice during brain development.

Cover page: Deciphering the Role of Emx1 in Neurogenesis: A Neuroproteomics Approach

Article
Peer Reviewed

Uncovering precision phenotype-biomarker associations in traumatic brain injury using topological data analysis

UC San Francisco Previously Published Works (2017)

Background

Traumatic brain injury (TBI) is a complex disorder that is traditionally stratified based on clinical signs and symptoms. Recent imaging and molecular biomarker innovations provide unprecedented opportunities for improved TBI precision medicine, incorporating patho-anatomical and molecular mechanisms. Complete integration of these diverse data for TBI diagnosis and patient stratification remains an unmet challenge.

Methods and findings

The Transforming Research and Clinical Knowledge in Traumatic Brain Injury (TRACK-TBI) Pilot multicenter study enrolled 586 acute TBI patients and collected diverse common data elements (TBI-CDEs) across the study population, including imaging, genetics, and clinical outcomes. We then applied topology-based data-driven discovery to identify natural subgroups of patients, based on the TBI-CDEs collected. Our hypothesis was two-fold: 1) A machine learning tool known as topological data analysis (TDA) would reveal data-driven patterns in patient outcomes to identify candidate biomarkers of recovery, and 2) TDA-identified biomarkers would significantly predict patient outcome recovery after TBI using more traditional methods of univariate statistical tests. TDA algorithms organized and mapped the data of TBI patients in multidimensional space, identifying a subset of mild TBI patients with a specific multivariate phenotype associated with unfavorable outcome at 3 and 6 months after injury. Further analyses revealed that this patient subset had high rates of post-traumatic stress disorder (PTSD), and enrichment in several distinct genetic polymorphisms associated with cellular responses to stress and DNA damage (PARP1), and in striatal dopamine processing (ANKK1, COMT, DRD2).

Conclusions

TDA identified a unique diagnostic subgroup of patients with unfavorable outcome after mild TBI that were significantly predicted by the presence of specific genetic polymorphisms. Machine learning methods such as TDA may provide a robust method for patient stratification and treatment planning targeting identified biomarkers in future clinical trials in TBI patients.

Trial registration

ClinicalTrials.gov Identifier NCT01565551.

Cover page: Uncovering precision phenotype-biomarker associations in traumatic brain injury using topological data analysis

Article
Peer Reviewed

Neuroinflammatory Biomarkers for Traumatic Brain Injury Diagnosis and Prognosis: A TRACK-TBI Pilot Study

UC San Francisco Previously Published Works (2023)

The relationship between systemic inflammation and secondary injury in traumatic brain injury (TBI) is complex. We investigated associations between inflammatory markers and clinical confirmation of TBI diagnosis and prognosis. The prospective TRACK-TBI Pilot (Transforming Research and Clinical Knowledge in Traumatic Brain Injury Pilot) study enrolled TBI patients triaged to head computed tomography (CT) and received blood draw within 24 h of injury. Healthy controls (HCs) and orthopedic controls (OCs) were included. Thirty-one inflammatory markers were analyzed from plasma. Area under the receiver operating characteristic curve (AUC) was used to evaluate discriminatory ability. AUC >0.7 was considered acceptable. Criteria included: TBI diagnosis (vs. OC/HC); moderate/severe vs. mild TBI (Glasgow Coma Scale; GCS); radiographic TBI (CT positive vs. CT negative); 3- and 6-month Glasgow Outcome Scale-Extended (GOSE) dichotomized to death/greater relative disability versus less relative disability (GOSE 1-4/5-8); and incomplete versus full recovery (GOSE <8/ = 8). One-hundred sixty TBI subjects, 28 OCs, and 18 HCs were included. Markers discriminating TBI/OC: HMGB-1 (AUC = 0.835), IL-1b (0.795), IL-16 (0.784), IL-7 (0.742), and TARC (0.731). Markers discriminating GCS 3-12/13-15: IL-6 (AUC = 0.747), CRP (0.726), IL-15 (0.720), and SAA (0.716). Markers discriminating CT positive/CT negative: SAA (AUC = 0.767), IL-6 (0.757), CRP (0.733), and IL-15 (0.724). At 3 months, IL-15 (AUC = 0.738) and IL-2 (0.705) discriminated GOSE 5-8/1-4. At 6 months, IL-15 discriminated GOSE 1-4/5-8 (AUC = 0.704) and GOSE <8/ = 8 (0.711); SAA discriminated GOSE 1-4/5-8 (0.704). We identified a profile of acute circulating inflammatory proteins with potential relevance for TBI diagnosis, severity differentiation, and prognosis. IL-15 and serum amyloid A are priority markers with acceptable discrimination across multiple diagnostic and outcome categories. Validation in larger prospective cohorts is needed. ClinicalTrials.gov Registration: NCT01565551.

Cover page: Neuroinflammatory Biomarkers for Traumatic Brain Injury Diagnosis and Prognosis: A TRACK-TBI Pilot Study