Modern large-scale and noisy datasets call for the development of statistical methods that are flexible enough to capture complicated underlying signal structure for better predictive performance, and meanwhile interpretable enough so that the outputs can be directly understood and used by practitioners who are not necessarily statisticians. This collection of research addresses these needs for flexible and interpretable statistical methods in two settings: interaction modeling, where the focus is to understand how interactions between predictors help with making predictions, and multivariate time series modeling, where the goal is to model (potentially evolving) dependence structures among several time series.
In Part I of this dissertation, we propose a framework for interaction modeling based on interaction reluctance, which is a recently developed principle that posits that main effects are preferred over interactions given similar prediction performance. We provide a highly non-trivial extension of existing method to generalized linear models (GLMs). Our proposal is free of any assumptions on the underlying structure among true interactions and is highly scalable to large datasets. Theoretically, we demonstrate that it consistently retains all the true interactions in the high-dimensional settings with high probability.
In Part II, we introduce TwinterNet, a neural network architecture designed to detect and utilize interactions for predictive modeling in the presence of multiple data views. One premise of our proposal is to distinguish interactions within data views from those between data views, and thus yielding more interpretability in the selected interactions. Furthermore, this framework is capable of capturing complex and nonlinear interactions among predictors and allows for arbitrary distributions of the response variable.
In Part III, we shift our focus to threshold multivariate time series modeling, where the dependence structures (e.g., a vector autoregressive model) among multiple time series may change as a function of some switching variables. In practice, the pivotal switching variable and the correspondingly different autoregressive processes (i.e., regimes) are largely unknown. We develop a framework that utilizes principal component analysis (PCA) to estimate the switching variable from multiple factors, and sparse regression techniques to determine regime changes. Numeric studies on both synthetic datasets and an economic data application shows the favorable performance of our proposal.