Search

Scholarly Works (9 results)

Sort By:

Article
Peer Reviewed

Nonlinear Matrix Approximation with Radial Basis Function Components

LBL Publications (2021)

We introduce and investigate matrix approximation by decomposition into a sum of radial basis function (RBF) components. An RBF component is a generalization of the outer product between a pair of vectors, where an RBF function replaces the scalar multiplication between individual vector elements. Even though the RBF functions are positive definite, the summation across components is not restricted to convex combinations and allows us to compute the decomposition for any real matrix that is not necessarily symmetric or positive definite. We formulate the problem of seeking such a decomposition as an optimization problem with a nonlinear and non-convex loss function. Several modern versions of the gradient descent method, including their scalable stochastic counterparts, are used to solve this problem. We provide extensive empirical evidence of the effectiveness of the RBF decomposition and that of the gradient-based fitting algorithm. While being conceptually motivated by singular value decomposition (SVD), our proposed nonlinear counterpart outperforms SVD by drastically reducing the memory required to approximate a data matrix with the same L2 error for a wide range of matrix types. For example, it leads to 2 to 6 times memory save for Gaussian noise, graph adjacency matrices, and kernel matrices. Moreover, this proximity-based decomposition can offer additional interpretability in applications that involve, e.g., capturing the inner low-dimensional structure of the data, retaining graph connectivity structure, and preserving the acutance of images.

Cover page: Nonlinear Matrix Approximation with Radial Basis Function Components

Article
Peer Reviewed

Prediction of atomization energy using graph kernel and active learning

LBL Publications (2019)

Data-driven prediction of molecular properties presents unique challenges to the design of machine learning methods concerning data structure/dimensionality, symmetry adaption, and confidence management. In this paper, we present a kernel-based pipeline that can learn and predict the atomization energy of molecules with high accuracy. The framework employs Gaussian process regression to perform predictions based on the similarity between molecules, which is computed using the marginalized graph kernel. To apply the marginalized graph kernel, a spatial adjacency rule is first employed to convert molecules into graphs whose vertices and edges are labeled by elements and interatomic distances, respectively. We then derive formulas for the efficient evaluation of the kernel. Specific functional components for the marginalized graph kernel are proposed, while the effects of the associated hyperparameters on accuracy and predictive confidence are examined. We show that the graph kernel is particularly suitable for predicting extensive properties because its convolutional structure coincides with that of the covariance formula between sums of random variables. Using an active learning procedure, we demonstrate that the proposed method can achieve a mean absolute error of 0.62 ± 0.01 kcal/mol using as few as 2000 training samples on the QM7 dataset.

Cover page: Prediction of atomization energy using graph kernel and active learning

Article
Peer Reviewed

Learning stochastic dynamics with statistics-informed neural network

UC Merced Previously Published Works (2023)

We introduce a machine-learning framework named statistics-informed neural network (SINN) for learning stochastic dynamics from data. This new architecture was theoretically inspired by a universal approximation theorem for stochastic systems, which we introduce in this paper, and the projection-operator formalism for stochastic modeling. We devise mechanisms for training the neural network model to reproduce the correct statistical behavior of a target stochastic process. Numerical simulation results demonstrate that a well-trained SINN can reliably approximate both Markovian and non-Markovian stochastic dynamics. We demonstrate the applicability of SINN to coarse-graining problems and the modeling of transition dynamics. Furthermore, we show that the obtained reduced-order model can be trained on temporally coarse-grained data and hence is well suited for rare-event simulations.

Cover page: Learning stochastic dynamics with statistics-informed neural network

Creative Commons 'BY' version 4.0 license

Article
Peer Reviewed

EFFICIENT COMPUTATION OF SURFACE SUNLIT FRACTIONS IN URBAN-SCALE BUILDING MODELING USING RAY-TRACING TECHNIQUES

LBL Publications (2020)

For building energy simulation at an urban-scale, solar shading calculations can be significantly slow when a large number of shading surfaces are considered in the solar shading calculations, due to the computational complexity of the geometry calculations. We developed a new algorithm using the ray-tracing technique to pre-calculate the sunlit fractions of all exterior surfaces in an urban district altogether. The ray tracing-based calculator is accelerated using General Purpose Graphics Processing Units (GPGPUs) and the Optix ray tracing library, and provides an efficient, flexible, and robust means for computing the sunlit fraction of large numbers of urban surfaces of complex geometries.

Cover page: EFFICIENT COMPUTATION OF SURFACE SUNLIT FRACTIONS IN URBAN-SCALE BUILDING MODELING USING RAY-TRACING TECHNIQUES

Article
Peer Reviewed

Modeling Thermal Interactions between Buildings in an Urban Context †

LBL Publications (2020)

Thermal interactions through longwave radiation exchange between buildings, especially in a dense urban environment, can strongly influence a building’s energy use and environmental impact. However, these interactions are either neglected or oversimplified in urban building energy modeling. We developed a new feature in EnergyPlus to explicitly consider this term in the surface heat balance calculations and developed an algorithm to batch calculating the surrounding surfaces’ view factors using a ray-tracing technique. We conducted a case study with a district in the Chicago downtown area to evaluate the longwave radiant heat exchange effects between urban buildings. Results show that the impact of the longwave radiant effects on annual energy use ranges from 0.1% to 3.3% increase for cooling and 0.3% to 3.6% decrease for heating, varying among individual buildings. At the district level, the total energy demand increases by 1.39% for cooling and decreases 0.45% for heating. We also observe the longwave radiation can increase the exterior surface temperature by up to 10 ◦C for certain exterior surfaces. These findings justify a detailed and accurate way to consider the thermal interactions between buildings in an urban context to inform urban planning and design.

Cover page: Modeling Thermal Interactions between Buildings in an Urban Context †

Article
Peer Reviewed

Detecting Label Noise via Leave-One-Out Cross-Validation

LBL Publications (2021)

We present a simple algorithm for identifying and correcting real-valued noisy labels from a mixture of clean and corrupted sample points using Gaussian process regression. A heteroscedastic noise model is employed, in which additive Gaussian noise terms with independent variances are associated with each and all of the observed labels. Optimizing the noise model using maximum likelihood estimation leads to the containment of the GPR model's predictive error by the posterior standard deviation in leave-one-out cross-validation. A multiplicative update scheme is proposed for solving the maximum likelihood estimation problem under non-negative constraints. While we provide proof of convergence for certain special cases, the multiplicative scheme has empirically demonstrated monotonic convergence behavior in virtually all our numerical experiments. We show that the presented method can pinpoint corrupted sample points and lead to better regression models when trained on synthetic and real-world scientific data sets.

Cover page: Detecting Label Noise via Leave-One-Out Cross-Validation

Article
Peer Reviewed

A High-Throughput Solver for Marginalized Graph Kernels on GPU

UC Berkeley Previously Published Works (2020)

We present the design and optimization of a linear solver on General Purpose GPUs for the efficient and high-throughput evaluation of the marginalized graph kernel between pairs of labeled graphs. The solver implements a preconditioned conjugate gradient (PCG) method to compute the solution to a generalized Laplacian equation associated with the tensor product of two graphs. To cope with the gap between the instruction throughput and the memory bandwidth of current generation GPUs, our solver forms the tensor product linear system on-the-fly without storing it in memory when performing matrix-vector dot product operations in PCG. Such on-the-fly computation is accomplished by using threads in a warp to cooperatively stream the adjacency and edge label matrices of individual graphs by small square matrix blocks called tiles, which are then staged in registers and the shared memory for later reuse. Warps across a thread block can further share tiles via the shared memory to increase data reuse. We exploit the sparsity of the graphs hierarchically by storing only non-empty tiles using a coordinate format and nonzero elements within each tile using bitmaps. Besides, we propose a new partition-based reordering algorithm for aggregating nonzero elements of the graphs into fewer but denser tiles to improve the efficiency of the sparse format.We carry out extensive theoretical analyses on the graph tensor product primitives for tiles of various density and evaluate their performance on synthetic and real-world datasets. Our solver delivers three to four orders of magnitude speedup over existing CPU-based solvers such as GraKeL and GraphKernels. The capability of the solver enables kernel-based learning tasks at unprecedented scales.

Cover page: A High-Throughput Solver for Marginalized Graph Kernels on GPU

Article
Peer Reviewed

Identifying micro-inversions using high-throughput sequencing reads

Applied Math & Comp Sci (2016)

Background

The identification of inversions of DNA segments shorter than read length (e.g., 100 bp), defined as micro-inversions (MIs), remains challenging for next-generation sequencing reads. It is acknowledged that MIs are important genomic variation and may play roles in causing genetic disease. However, current alignment methods are generally insensitive to detect MIs. Here we develop a novel tool, MID (Micro-Inversion Detector), to identify MIs in human genomes using next-generation sequencing reads.

Results

The algorithm of MID is designed based on a dynamic programming path-finding approach. What makes MID different from other variant detection tools is that MID can handle small MIs and multiple breakpoints within an unmapped read. Moreover, MID improves reliability in low coverage data by integrating multiple samples. Our evaluation demonstrated that MID outperforms Gustaf, which can currently detect inversions from 30 bp to 500 bp.

Conclusions

To our knowledge, MID is the first method that can efficiently and reliably identify MIs from unmapped short next-generation sequencing reads. MID is reliable on low coverage data, which is suitable for large-scale projects such as the 1000 Genomes Project (1KGP). MID identified previously unknown MIs from the 1KGP that overlap with genes and regulatory elements in the human genome. We also identified MIs in cancer cell lines from Cancer Cell Line Encyclopedia (CCLE). Therefore our tool is expected to be useful to improve the study of MIs as a type of genetic variant in the human genome. The source code can be downloaded from: http://cqb.pku.edu.cn/ZhuLab/MID .

Cover page: Identifying micro-inversions using high-throughput sequencing reads

Article
Peer Reviewed

Graphical Gaussian Process Regression Model for Aqueous Solvation Free Energy Prediction of Organic Molecules in Redox Flow Battery

LBL Publications (2021)

The solvation free energy of organic molecules is a critical parameter in determining emergent properties such as solubility, liquid-phase equilibrium constants, and pKa and redox potentials in an organic redox flow battery. In this work, we present a machine learning (ML) model that can learn and predict the aqueous solvation free energy of an organic molecule using Gaussian process regression method based on a new molecular graph kernel. To investigate the performance of the ML model on electrostatic interaction, the nonpolar interaction contribution of solvent and the conformational entropy of solute in solvation free energy, three data sets with implicit or explicit water solvent models, and contribution of conformational entropy of solute are tested. We demonstrate that our ML model can predict the solvation free energy of molecules at chemical accuracy with a mean absolute error of less than 1 kcal/mol for subsets of the QM9 dataset and the Freesolv database. To solve the general data scarcity problem for a graph-based ML model, we propose a dimension reduction algorithm based on the distance between molecular graphs, which can be used to examine the diversity of the molecular data set. It provides a promising way to build a minimum training set to improve prediction for certain test sets where the space of molecular structures is predetermined.

Cover page: Graphical Gaussian Process Regression Model for Aqueous Solvation Free Energy Prediction of Organic Molecules in Redox Flow Battery