This dissertation aims to address the problem of estimating the number of communities in count-weighted networks. We propose two methods from very different perspectives.
The first method we propose is a stepwise procedure, referred to as Stepwise Model-Assisted Spectral Thresholding (SMAST), selecting the number of communities in general count-weighted networks, which are common in the real-world. In the $m$-th step of the procedure, we first cluster the nodes into $m$ groups with a certain spectral clustering method, and estimate the mean parameters of the general degree-corrected stochastic blockmodel (DCSBM). Then the adjacency matrix is normalized by the estimated degree-correction parameters as well as the solution to a small-scale matrix scaling problem. The eigenvalues of the resultant normalized adjacency matrix are then truncated with the threshold $(2+\epsilon)\sqrt{n}$ in magnitude, where $\epsilon$ is a prespecified small constant, e.g. $0.05$. The procedure continues to the $(m+1)$-th step if the number of remaining significant eigenvalues is greater than $m$, and stops otherwise.A prominent feature of this method is that it is derived under the general DCSBM, and can be applied to general count-weighted networks. In theory, if SCORE is used for spectral clustering, SMAST is shown to be consistent in estimating the true number of communities under certain assumptions that hold for a broad class of count-weighted networks. An extension of the Nonsplitting Property of SCORE to the general DCSBM, and recent results on spectral radii of inhomogeneous random matrices, play essential roles in the analyses. Extensive numerical experiments on simulated and real networks have also been conducted to demonstrate the empirical effectiveness of SMAST.
The second method is a network rank selection method based on approximate risk estimation. We treat the problem of estimating the number of communities in a network as selecting the rank of the expected adjacency matrix.For each given rank $r$, the spectral estimate of the expected adjacency matrix is obtained by spectral truncation of the observed adjacency matrix with rank $r$. The goodness of rank $r$ is evaluated by quantifying the estimation error of the estimated expected adjacency matrix. By choosing the error measure based on the binomial deviance, following existing work in the literature on risk estimation under the Bernoulli model, we can estimate the optimism of the apparent error over the true error by the divergence of the estimator with respect to the observation. Moreover, the divergence formula of any spectral function can be calculated in closed-form, which provides an estimate of the true error as the apparent error plus the divergence.
This framework can be straightforwardly extended to Poisson networks, and thus can be used to solve the problem of rank selection for count-weighted networks. The closed-form formula of the estimated true error is derived in this case as well.
We have applied this method to several benchmark networks in the literature and found the performance comparable and even better than other state-of-the-art model-based approaches.