Search

Thesis
Peer Reviewed

Efficient Computing in Hyperdimensional Space

Kang, Jaeyoung
Advisor(s): Rosing, Tajana

UC San Diego Electronic Theses and Dissertations (2023)

The rapid proliferation of data has expedited the development of cutting-edge machine learning (ML) methods, including deep neural networks, to find underlying insights from data. However, the large amount of data movement has driven up energy requirements, posing a new threat of surpassing global energy production. In this dissertation, we explore a novel computing paradigm, hyperdimensional computing (HDC), which mimics attributes of the human brain's neuronal circuits with lightweight arithmetic on low-precision high-dimensional vectors and significantly enhances speed and energy efficiency. We present OpenHD, a GPU-based software infrastructure for HDC, which automatically generates optimized GPU code of HDC applications with the Just-in-Time compilation to ease the development and optimization efforts for the deployment. The proposed framework achieves 4.5x and 146x speedup on average for GPU-based HDC classification and clustering, respectively.

Using OpenHD, we first expand the variety of data structures that can be supported in HDC applications. We design an HDC-based ML solution that supports graph data, and implement it on GPU. Our solution, RelHD, enables graph-based ML by aggregating the relationship between data and features of data into a single high-dimensional vector. We further accelerate RelHD using processing in-memory (PIM) based on FeFET. The PIM accelerator offers 10x speedup and 986x energy efficiency improvement over the state-of-the-art crossbar memory-based graph neural network accelerator.

While most existing HDC-based applications have used small-scale datasets, we show that HDC can scale to tackle the large-scale problem. We present an HDC-based approach called HyperOMS for large-scale open modification spectral library searching (OMS) in mass spectrometry-based proteomics analysis. We develop a novel HDC-based OMS algorithm and accelerate it on GPU using the OpenHD framework. To run HyperOMS efficiently, we devise a DRAM-based PIM accelerator with optimization strategies to maximize parallelism. Evaluation results show that the accelerator yields up to 100x speedup and has 1,337x higher energy efficiency over the state-of-the-art OMS tool running on GPU while offering comparable search quality to competing solutions.

Article
Peer Reviewed

HyperSpec: Ultrafast Mass Spectra Clustering in Hyperdimensional Space

UC San Diego Previously Published Works (2023)

As current shotgun proteomics experiments can produce gigabytes of mass spectrometry data per hour, processing these massive data volumes has become progressively more challenging. Spectral clustering is an effective approach to speed up downstream data processing by merging highly similar spectra to minimize data redundancy. However, because state-of-the-art spectral clustering tools fail to achieve optimal runtimes, this simply moves the processing bottleneck. In this work, we present a fast spectral clustering tool, HyperSpec, based on hyperdimensional computing (HDC). HDC shows promising clustering capability while only requiring lightweight binary operations with high parallelism that can be optimized using low-level hardware architectures, making it possible to run HyperSpec on graphics processing units to achieve extremely efficient spectral clustering performance. Additionally, HyperSpec includes optimized data preprocessing modules to reduce the spectrum preprocessing time, which is a critical bottleneck during spectral clustering. Based on experiments using various mass spectrometry data sets, HyperSpec produces results with comparable clustering quality as state-of-the-art spectral clustering tools while achieving speedups by orders of magnitude, shortening the clustering runtime of over 21 million spectra from 4 h to only 24 min.

Cover page: HyperSpec: Ultrafast Mass Spectra Clustering in Hyperdimensional Space

Creative Commons 'BY' version 4.0 license

Article
Peer Reviewed

Accelerating open modification spectral library searching on tensor core in high-dimensional space

UC San Diego Previously Published Works (2023)

Motivation

Driven by technological advances, the throughput and cost of mass spectrometry (MS) proteomics experiments have improved by orders of magnitude in recent decades. Spectral library searching is a common approach to annotating experimental mass spectra by matching them against large libraries of reference spectra corresponding to known peptides. An important disadvantage, however, is that only peptides included in the spectral library can be found, whereas novel peptides, such as those with unexpected post-translational modifications (PTMs), will remain unknown. Open modification searching (OMS) is an increasingly popular approach to annotate modified peptides based on partial matches against their unmodified counterparts. Unfortunately, this leads to very large search spaces and excessive runtimes, which is especially problematic considering the continuously increasing sizes of MS proteomics datasets.

Results

We propose an OMS algorithm, called HOMS-TC, that fully exploits parallelism in the entire pipeline of spectral library searching. We designed a new highly parallel encoding method based on the principle of hyperdimensional computing to encode mass spectral data to hypervectors while minimizing information loss. This process can be easily parallelized since each dimension is calculated independently. HOMS-TC processes two stages of existing cascade search in parallel and selects the most similar spectra while considering PTMs. We accelerate HOMS-TC on NVIDIA's tensor core units, which is emerging and readily available in the recent graphics processing unit (GPU). Our evaluation shows that HOMS-TC is 31× faster on average than alternative search engines and provides comparable accuracy to competing search tools.

Availability and implementation

HOMS-TC is freely available under the Apache 2.0 license as an open-source software project at https://github.com/tycheyoung/homs-tc.

Cover page: Accelerating open modification spectral library searching on tensor core in high-dimensional space

Article
Peer Reviewed

SpecHD: Hyperdimensional Computing Framework for FPGA-Based Mass Spectrometry Clustering

UC San Diego Previously Published Works (2024)

Article
Peer Reviewed

FPGA Acceleration of Protein Back-Translation and Alignment

UC San Diego Previously Published Works (2021)

Identifying genome functionality changes our understanding of humans and helps us in disease diagnosis; as well as drug, bio-material, and genetic engineering of plants and animals. Comparing the structure of the protein sequences, when only sequence information is available, against a database with known functionality helps us to identify and recognize the functionality of the unknown sequence. The process of predicting the possible RNA sequence that a specific protein has originated from is called back-translation. Aligning the back-translated RNA sequence against the database locates the most similar sequences, which is used to predict the functionality of the unknown protein sequence. Providing massive parallelism, FPGAs can accelerate bioinformatics applications substantially. In this paper, we propose, FabP11FabP is also the name of a family of proteins, 'Fatty-Acid-Binding Proteins'., an optimized FPGA-based accelerator for aligning a back-translated protein sequence against a database of DNA/RNA sequences. FabP is deeply optimized to fully utilize the FPGA resources and the DRAM memory bandwidth to maximize the performance. FabP on a mid-range FPGA provides 8.1 % and 23.3× (24.8× and 266.8 ×) speedup and higher energy efficiency as compared to the GPU-based implementation on a high-end NVIDIA GPU (state-of-the-art CPU implementation), respectively.

Cover page: FPGA Acceleration of Protein Back-Translation and Alignment

Article
Peer Reviewed

HDBind: encoding of molecular structure with hyperdimensional binary representations.

UC San Diego Previously Published Works (2024)

Traditional methods for identifying hit molecules from a large collection of potential drug-like candidates rely on biophysical theory to compute approximations to the Gibbs free energy of the binding interaction between the drug and its protein target. These approaches have a significant limitation in that they require exceptional computing capabilities for even relatively small collections of molecules. Increasingly large and complex state-of-the-art deep learning approaches have gained popularity with the promise to improve the productivity of drug design, notorious for its numerous failures. However, as deep learning models increase in their size and complexity, their acceleration at the hardware level becomes more challenging. Hyperdimensional Computing (HDC) has recently gained attention in the computer hardware community due to its algorithmic simplicity relative to deep learning approaches. The HDC learning paradigm, which represents data with high-dimension binary vectors, allows the use of low-precision binary vector arithmetic to create models of the data that can be learned without the need for the gradient-based optimization required in many conventional machine learning and deep learning methods. This algorithmic simplicity allows for acceleration in hardware that has been previously demonstrated in a range of application areas (computer vision, bioinformatics, mass spectrometery, remote sensing, edge devices, etc.). To the best of our knowledge, our work is the first to consider HDC for the task of fast and efficient screening of modern drug-like compound libraries. We also propose the first HDC graph-based encoding methods for molecular data, demonstrating consistent and substantial improvement over previous work. We compare our approaches to alternative approaches on the well-studied MoleculeNet dataset and the recently proposed LIT-PCBA dataset derived from high quality PubChem assays. We demonstrate our methods on multiple target hardware platforms, including Graphics Processing Units (GPUs) and Field Programmable Gate Arrays (FPGAs), showing at least an order of magnitude improvement in energy efficiency versus even our smallest neural network baseline model with a single hidden layer. Our work thus motivates further investigation into molecular representation learning to develop ultra-efficient pre-screening tools. We make our code publicly available at https://github.com/LLNL/hdbind .

Cover page: HDBind: encoding of molecular structure with hyperdimensional binary representations.

Article
Peer Reviewed

Swapping Metagenomics Preprocessing Pipeline Components Offers Speed and Sensitivity Increases

UC San Diego Previously Published Works (2022)

Increasing data volumes on high-throughput sequencing instruments such as the NovaSeq 6000 leads to long computational bottlenecks for common metagenomics data preprocessing tasks such as adaptor and primer trimming and host removal. Here, we test whether faster recently developed computational tools (Fastp and Minimap2) can replace widely used choices (Atropos and Bowtie2), obtaining dramatic accelerations with additional sensitivity and minimal loss of specificity for these tasks. Furthermore, the taxonomic tables resulting from downstream processing provide biologically comparable results. However, we demonstrate that for taxonomic assignment, Bowtie2's specificity is still required. We suggest that periodic reevaluation of pipeline components, together with improvements to standardized APIs to chain them together, will greatly enhance the efficiency of common bioinformatics tasks while also facilitating incorporation of further optimized steps running on GPUs, FPGAs, or other architectures. We also note that a detailed exploration of available algorithms and pipeline components is an important step that should be taken before optimization of less efficient algorithms on advanced or nonstandard hardware. IMPORTANCE In shotgun metagenomics studies that seek to relate changes in microbial DNA across samples, processing the data on a computer often takes longer than obtaining the data from the sequencing instrument. Recently developed software packages that perform individual steps in the pipeline of data processing in principle offer speed advantages, but in practice they may contain pitfalls that prevent their use, for example, they may make approximations that introduce unacceptable errors in the data. Here, we show that differences in choices of these components can speed up overall data processing by 5-fold or more on the same hardware while maintaining a high degree of correctness, greatly reducing the time taken to interpret results. This is an important step for using the data in clinical settings, where the time taken to obtain the results may be critical for guiding treatment.

Cover page: Swapping Metagenomics Preprocessing Pipeline Components Offers Speed and Sensitivity Increases

Article

Wastewater and surface monitoring to detect COVID-19 in elementary school settings: The Safer at School Early Alert project

UC San Diego Previously Published Works (2023)

BACKGROUND: Schools are high-risk settings for SARS-CoV-2 transmission, but necessary for children's educational and social-emotional wellbeing. Previous research suggests that wastewater monitoring can detect SARS-CoV-2 infections in controlled residential settings with high levels of accuracy. However, its effective accuracy, cost, and feasibility in non-residential community settings is unknown. METHODS: The objective of this study was to determine the effectiveness and accuracy of community-based passive wastewater and surface (environmental) surveillance to detect SARS-CoV-2 infection in neighborhood schools compared to weekly diagnostic (PCR) testing. We implemented an environmental surveillance system in nine elementary schools with 1700 regularly present staff and students in southern California. The system was validated from November 2020 - March 2021. FINDINGS: In 447 data collection days across the nine sites 89 individuals tested positive for COVID-19, and SARS-CoV-2 was detected in 374 surface samples and 133 wastewater samples. Ninety-three percent of identified cases were associated with an environmental sample (95% CI: 88% - 98%); 67% were associated with a positive wastewater sample (95% CI: 57% - 77%), and 40% were associated with a positive surface sample (95% CI: 29% - 52%). The techniques we utilized allowed for near-complete genomic sequencing of wastewater and surface samples. INTERPRETATION: Passive environmental surveillance can detect the presence of COVID-19 cases in non-residential community school settings with a high degree of accuracy. FUNDING: County of San Diego, Health and Human Services Agency, National Institutes of Health, National Science Foundation, Centers for Disease Control.

Cover page: Wastewater and surface monitoring to detect COVID-19 in elementary school settings: The Safer at School Early Alert project

Article
Peer Reviewed

Safer at school early alert: an observational study of wastewater and surface monitoring to detect COVID-19 in elementary schools

UC San Diego Previously Published Works (2023)

Background

Schools are high-risk settings for SARS-CoV-2 transmission, but necessary for children's educational and social-emotional wellbeing. Previous research suggests that wastewater monitoring can detect SARS-CoV-2 infections in controlled residential settings with high levels of accuracy. However, its effective accuracy, cost, and feasibility in non-residential community settings is unknown.

Methods

The objective of this study was to determine the effectiveness and accuracy of community-based passive wastewater and surface (environmental) surveillance to detect SARS-CoV-2 infection in neighborhood schools compared to weekly diagnostic (PCR) testing. We implemented an environmental surveillance system in nine elementary schools with 1700 regularly present staff and students in southern California. The system was validated from November 2020 to March 2021.

Findings

In 447 data collection days across the nine sites 89 individuals tested positive for COVID-19, and SARS-CoV-2 was detected in 374 surface samples and 133 wastewater samples. Ninety-three percent of identified cases were associated with an environmental sample (95% CI: 88%-98%); 67% were associated with a positive wastewater sample (95% CI: 57%-77%), and 40% were associated with a positive surface sample (95% CI: 29%-52%). The techniques we utilized allowed for near-complete genomic sequencing of wastewater and surface samples.

Interpretation

Passive environmental surveillance can detect the presence of COVID-19 cases in non-residential community school settings with a high degree of accuracy.

Funding

County of San Diego, Health and Human Services Agency, National Institutes of Health, National Science Foundation, Centers for Disease Control.

Cover page: Safer at school early alert: an observational study of wastewater and surface monitoring to detect COVID-19 in elementary schools

Creative Commons 'BY-NC-ND' version 4.0 license