- Main
Topics in Clustering: Feature Selection and Semiparametric Modeling
- Pu, Xiao
- Advisor(s): Arias-Castro, Ery
Abstract
The first part of this thesis is concerned with Sparse Clustering, which assumes that a potentially large set of features are associated with clustering observations but the true underlying clusters differ only with respect to some of the features. We propose two approaches for this purpose, both of which allow us to group the observations using only a carefully-chosen subset of the features. The first approach assumes that the data are generated from Gaussian mixture models in high dimensions and the difference between mean vectors of the Gaussian components is sparse. Enlightened by the connection between sparse principal component analysis (SPCA) and sparse clustering, we adapted multiple estimation strategies from SPCA to perform sparse clustering. We provide theoretical guarantee of the aggregated estimator and develop an iterative algorithm to uncover the important feature set in sparse clustering. The second one is a hill-climbing approach, which alternates between selecting the s most important features (that correspond to the s smallest within-cluster dissimilarities) and clustering observations based on the selected feature subset. This approach has been shown to be competitive with existing methods in literature on simulated and real-world datasets.
In the second part of the thesis, we consider a semiparametric approach to clustering and develop related theory. We first consider the problem of fitting a mixture model under the assumption that the mixture components are symmetric and log-concave. We study the nonparametric maximum likelihood estimation (NPMLE) of a monotone and log-concave probability density (which we do as part of our algorithm), and derive some results in terms of existence, uniqueness and uniform consistency of the MLE. To t the mixture model, we propose a semiparametric EM (SEM) algorithm, which can be adapted to other semiparametric mixture models. We then consider mixture modeling in high dimensions using radial (or elliptical) distributions. In the process of working on this problem, we uncovered a difficulty in estimating the densities. We found that the i.i.d. d-dimensional data points sampled from a rotationally invariant distribution $f$ with $f(x) = g(\|x\|)$, are highly concentrated on the sphere of a $d$-dimensional ball as $d \to \infty$. This extends the well-known behavior of the normal distribution (its concentration around the sphere of radius square-root of the dimension) to other radial densities. We establish a form of concentration of measure, and even a convergence in distribution, under additional assumptions. We draw some possible consequences for statistical modeling in high-dimensions, including a possible universality property of Gaussian Mixtures.
Main Content
Enter the password to open this PDF file:
-
-
-
-
-
-
-
-
-
-
-
-
-
-