Recent advances in technologies for cheaper and faster data acquisition and storage have led to an explosive growth of data complexity in a variety of scientific areas. As a result, noise accumulation, experimental variation, and data inhomogeneity have become substantial. However, many classical classification and regression methods in such settings are known to pose many statistical challenges and hence call for new methods and theories.
This thesis is devoted to robust classification and regression algorithms with theoretical guarantee on important statistical properties. In Chapter 1, we present a classification framework – ArchBoost, which applies to a wide range of loss functions including nonconvex losses and is specifically designed to be robust and efficient whenever the labels are recorded with an error or whenever the data are contaminated with outliers. In Chapter 2, we introduce a forest-type framework for regression problems, and prove that many state-of-the art forest algorithms belong to this framework. We then propose robust forest-type regression methods by applying our proposed framework to robust loss functions. In Chapter 3, we design a novel estimating equation motivated by the framework in Chapter 2 to solve quantile regression problem on random censored data. In Chapter 4, we focus on high-dimensional left-censored quantile regression and study its inference problem. We modify the quantile loss to accommodate the left-censored nature of the problem, by extending the idea of redistribution of mass. For the inference part, asymptotic properties are carefully investigated. All the methods in aforementioned chapters are tested through extensive numerical experiments on both simulated and real data sets.