Due to the advancements of modern technologies, large-scale and high-dimensional data have been widely collected in almost every scientific disciplines.This introduces a several challenges including that the data are often accompanied by outliers due to possible measurement error, or many variables follow heavy-tailed distributions.
To address these challenges, my thesis proposes methodologies in the setting of the mean estimation and matrix recovery when the data have asymmetric and heavy-tailed distributions.
Additionally, I explore the characterization of tail behavior in random outcomes, focusing on expected shortfall, which is widely recognized as a measure of risk.
I propose nonparametric approaches for estimating expected shortfall, aiming to enhance its accuracy and applicability.
In Chapter 1, we propose a robust estimator to recover approximately low-rank matrices in the presence of heavy-tailed and asymmetric noises.Focusing on three archetypal applications including matrix compressed sensing, matrix completion and multitask learning, we provide sub-Gaussian-type deviation bounds when the noise variables only have bounded variances.
Computationally, we propose a matrix version of the local adaptive majorize-minimization algorithm, which is much faster than the alternating direction
method of multiplier used in previous work and is scalable to large datasets.
Chapter 2 studies the problem of robust and differentially private mean estimation and inference.We first provide a comprehensive analysis of the Huber mean estimator with increasing dimensions, including non-asymptotic deviation bound, Bahadur representation, and (uniform) Gaussian approximations.
Then, we privatize the Huber mean estimator via noisy gradient descent, and construct private confidence intervals for the proposed estimator by incorporating a private and robust covariance estimator.
In Chapter 3, we consider the problem of nonparametric estimation of conditional expected shortfall functions.To mitigate the curse of dimensionality, we propose a two-step nonparametric ES estimator based on fully connected neural nets with the ReLU activation function.
This approach (i) involves unobservable surrogate response variables that must be estimated from data in a preliminary step, and (ii) uses a properly chosen Huber loss to achieve exponential deviation bounds under heavy-tailed response distributions.
Using a plugged-in nonparametric conditional quantile estimate, also trained on deep neural nets, we establish non-asymptotic high probability bounds for the final robust ES estimator, which are optimal as if the true quantile function were known without resorting to any type of sample splitting.
We demonstrate the effectiveness of deep robust ES regression with both numerical experiments and an empirical study on the impact of El Niño on heavy precipitations, for which effective tail learning is imperative.
In Chapter 4, I introduce a two-step nonparametric ES estimator that involves a plugged-in quantile function estimate without sample-splitting. We provide non-asymptotic estimation and Gaussian approximation error bounds, depending explicitly on the effective dimension, sample size, regularization parameters, and quantile estimation error.
To construct pointwise confidence bands, we propose a fast multiplier bootstrap procedure and establish its validity.
We demonstrate the finite-sample performance of the proposed methods through numerical experiments
and an empirical study aimed at examining the heterogeneous effects of features on average and large medical expenses.