New Powerful and Robust Graph-based Tests For High-dimensional and Non-Euclidean Data
- Zhu, Yejiong
- Advisor(s): Chen, Hao
Abstract
Two-sample hypothesis testing is a fundamental problem in statistics, but it faces challenges whendealing with high-dimensional and non-Euclidean data, which are increasingly common today. Parametric approaches are often limited by assumptions about specific distributions, and their power tends to decrease rapidly with increasing dimensionality unless very strong assumptions are made. This has led to a surge in nonparametric methods for handling modern data. Among them, graph-based methods have gained attention for their flexibility and strong performance across diverse alternatives. These methods utilize a similarity graph, such as the K-minimum spanning tree or the K-nearest neighbor graph, constructed on the observations, and can be applied to a dataset whenever a reasonable similarity measure is available. Despite the advancements of graph-based methods in recent years, there are still many unsolved issues within this framework. This dissertation contributes to advancing graph-based methods in several important aspects. Traditional graph-based methods rely on sparse similarity graphs, where the number of edges is proportional to the number of observations. We found that using denser graphs often substantially enhances testing power. To support this, we developed a new theoretical framework that establishes the limiting distribution of the test statistics under the permutation null distribution for a broader range of graphs, from sparse to dense. Even for sparse graphs, our new conditions are considerably weaker than existing ones, expanding the validity of the asymptotic null distribution of graph-based test statistics and ensuring fast type I error control for a much wider range of graphs. In addition to addressing graph density, we tackle the challenge posed by high-dimensional settings, where commonly used similarity graphs that have shown to be powerful for the graph-based tests often lead to formation of hubs – nodes with disproportionately large degrees – which can negatively impact the performance of graph-based methods. To address this, we propose a robust graph structure that reduces the size and frequency of hubs, thereby improving the effectiveness of graph-based methods in both two-sample testing and change-point detection. Furthermore, current graph-based methods rely primarily on similarity graphs, which capture only one aspect of the data. To exploit a broader range of information, we introduce a novel method that integrates both similarity and dissimilarity graphs. This new method demonstrates superior performance across a wider range of alternatives. We also derive the limiting distribution of the newly proposed test statistic under the permutation null distribution, ensuring fast and reliable type I error control.