Networks are abstract representations of relationships between a set of entities. As such they can be used to represent data in a variety of complex interactive systems such as people and their social connections, researchers and their collaborations, proteins and their interactions, and so on. Vast amounts of such interaction data are being collected routinely in a range of disciplines and thus call for the attention of the statistician. Due to their large size (number of observations scales as the square of the number of nodes), traditional statistical methods are usually not scalable and one needs to come up with more computationally feasible inference techniques.
A concrete example of the issue is the problem of community detection in networks. Traditional likelihood-based methods are computationally intractable, so researchers have come up with various computation-friendly alternatives. Although these methods work well on small to moderately large networks, most of them cannot handle truly large networks in a reasonable amount of time.
In this dissertation, we first advance divide and conquer strategies for community detection. We propose two algorithms which perform clustering on a number of small subgraphs and finally patch the results into a single clustering. The main advantage of these algorithms is that they bring down significantly the computational cost of traditional algorithms, including spectral clustering, semidefinite programs, modularity based methods, likelihood based methods, etc., without losing on accuracy and even improving accuracy at times. These algorithms are also, by nature, parallelizable. Thus, exploiting the facts that most traditional algorithms are accurate and the corresponding optimization problems are much simpler in small problems, our divide and conquer methods provide an omnibus recipe for scaling traditional algorithms up to large networks. We prove consistency of these algorithms under various subgraph selection procedures and perform extensive simulations and real data analysis to understand the advantages of the divide and conquer approach in various settings.
We then extend these divide and conquer methods to the more realistic situation of mixed memberships. Models that can be tackled are the mixed membership blockmodel, topic models, etc.
Next we focus on the problem of network comparison. We tackle two aspects of this problem: clustering and changepoint detection.
While being able to cluster within a network, in the sense of community detection, is important, there are emerging needs to be able to \emph{cluster multiple networks}. This is largely motivated by the routine collection of network data that are generated from potentially different populations. These networks may or may not have node correspondence. For example, brain networks of a group of patients have node correspondence, whereas collaboration networks of researchers in different disciplines such as Computer Science, Mathematics or Statistics will have little node correspondence. When node correspondence is present, we cluster networks by summarizing a network by its graphon estimate, whereas when node correspondence is not present, we propose a novel solution for clustering such networks by associating a computationally feasible feature vector to each network based on traces of powers of the adjacency matrix. We illustrate our methods using both simulated and real data sets, and theoretical justifications are provided in terms of consistency.
In the changepoint problem, one observes a series of networks indexed by time and wishes the check if there is some significant change in the structure of these networks at some point of time. Potential applications are in, for instance, brain imaging, where one has brain scans of individuals collected over time and is looking for abnormalities, ecological networks observed over time, where one wonders if there is a structural change. We consider a CUSUM (short for cumulative sum) statistic for this problem, and prove its consistency. We find that in this high dimensional setting, the estimation error rate is better than the classical rate for fixed dimensional changepoint problems. As applications, we detect changepoints in the MIT reality mining data and the US senate roll call data.