- Main
Bipartite Network Community Detection: Development and Survey of Algorithmic and Stochastic Block Model Based Methods
- Sun, Yidan
- Advisor(s): Li, Jingyi
Abstract
In a bipartite network, nodes are divided into two types, and edges are only allowed to connect nodes of different types. Bipartite network clustering problems aim to identify node groups with more edges between themselves and fewer edges to the rest of the network. The approaches for community detection in the bipartite network can roughly be classified into algorithmic and model-based methods. The algorithmic methods solve the problem either by greedy searches in a heuristic way or optimizing based on some criteria over all possible partitions. The model-based methods fit a generative model to the observed data and study the model in a statistically principled way. In this dissertation, we mainly focus on bipartite clustering under two scenarios: incorporation of node covariates and detection of mixed membership communities.
In Chapter 2, we study an algorithmic bipartite clustering method in the context of gene co-clustering. Gene co-clustering is a widely-used technique that has enabled computational prediction of unknown gene functions within a species. However, it remains a challenge to refine gene function prediction by leveraging evolutionarily conserved genes in another species. This challenge calls for a new computational algorithm to identify gene co-clusters in two species, so that genes in each co-cluster exhibit similar expression levels in each species and strong conservation between the species. Here we develop the bipartite tight spectral clustering (BiTSC) algorithm, which identifies gene co-clusters in two species based on gene orthology information and gene expression data. BiTSC novelly implements a formulation that encodes gene orthology as a bipartite network and gene expression data as node covariates. This formulation allows BiTSC to adopt and combine the advantages of multiple unsupervised learning techniques: kernel enhancement, bipartite spectral clustering, consensus clustering, tight clustering, and hierarchical clustering. As a result, BiTSC is a flexible and robust algorithm capable of identifying informative gene co-clusters without forcing all genes into co-clusters. Another advantage of BiTSC is that it does not rely on any distributional assumptions. Beyond cross-species gene co-clustering, BiTSC also has wide applications as a general algorithm for identifying tight node co-clusters in any bipartite network with node covariates. We demonstrate the accuracy and robustness of BiTSC through comprehensive simulation studies. In a real data example, we use BiTSC to identify conserved gene co-clusters of D. melanogaster and C. elegans, and we perform a series of downstream analyses to both validate BiTSC and verify the biological significance of the identified co-clusters.
The stochastic block model (SBM) serves as a fundamental and classic tool to study model-based community detection in a statistically principled way. It is a simple generative model but too restrictive to include empirical network characteristics. With the flexibility of the SBM, many SBM variants have been proposed to accommodate the real-world scenarios, such as node degree heterogeneity, mixed membership communities, and bipartite structure. In Chapter 3, we review a few extensions of the SBM, including degree-correction SBM, mixed membership SBM, and bipartite SBM. We conduct a survey about community detection methods based on the mixed membership SBM and bipartite SBM. We discuss these methods in detail and summarize the challenges in introducing degree-correction to the mixed membership SBM; the challenges in extending the degree-corrected mixed membership SBM to the bipartite network.
To address some of the research gaps discussed in Chapter 3, we construct a generative bipartite degree-corrected mixed membership SBM in Chapter 4. To fit the model, we propose a variational expectation-maximization (EM) algorithm. We perform a preliminary simulation study in which the result suggests that the algorithm is sensitive to the initial values. We discuss the possible improvements for future directions.
Main Content
Enter the password to open this PDF file:
-
-
-
-
-
-
-
-
-
-
-
-
-
-