Skip to main content
eScholarship
Open Access Publications from the University of California

UC Santa Cruz

UC Santa Cruz Electronic Theses and Dissertations bannerUC Santa Cruz

Flexible Bayesian Modeling of Multivariate Count Data

Creative Commons 'BY' version 4.0 license
Abstract

The analysis of multivariate count data presents significant statistical challenges due to its discrete nature, excess zeroes, over-dispersion, and high dimensionality, which are often encountered in practical applications. These challenges are further complicated by the presence of covariates. Traditional methods frequently struggle with these complexities, potentially leading to inferior performance in estimating feature abundance and their dependencies. This thesis develops flexible Bayesian statistical methodologies, particularly for cases where the distribution of a multivariate random vector exhibits non-Gaussianity, heterogeneity, and heteroscedasticity, using count table data from microbiome studies as motivating examples. First, we propose a Bayesian zero-inflated rounded log-normal kernel method that infers feature interdependencies through the covariance between features measured in counts. We employ a factor model that assumes a lower-dimensional structure for the covariance matrix, and impose joint sparsity on its factor loadings using a Dirichlet-Laplace (Dir-Laplace) prior. This sparse spiked covariance structure reduces the number of parameters and robustifies the estimation in high-dimensional settings. A regression model is used to characterize changes in mean feature abundance with covariates, and a Bayesian nonparametric approach is adopted to handle large variability across samples. For problems involving multiple count tables obtained from different groups, we extend the sparse factor model and develop a Bayesian group factor model that infers within-group and across-group feature interdependencies. We incorporate a flexible infinite mixture of log-normal rounded kernels through the Dirichlet process prior directly for count vectors and construct a Dirichlet-Horseshoe (Dir-HS) shrinkage prior for factor loadings to more efficiently induce joint sparsity for the greater number of features in a multiple group setting. Lastly, we develop a covariate-dependent factor model that flexibly estimates heteroscedasticity in the covariance matrix due to covariates, addressing the problem of the mean and covariance structure of a multivariate count vector varying with covariates. Our approach employs covariance regression through linear regression on the lower-dimensional factor loading matrix. This formulation, combined with joint sparsity imposed by the Dir-HS prior, provides robust estimation of covariate-dependent covariance in high-dimensional settings. For all developed models, we carefully explore their properties and perform extensive simulation studies to examine their performance. In addition, real data examples from microbiome studies are used for illustration.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View