Semi-supervised Community Detection via Structural Similarity Metrics

Authors: Yicong Jiang, Tracy Ke

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We study the performance of Angel Min+, where ˆΠU is from SCORE+ (Jin et al., 2021). We compare our methods with SNMF (Yang et al., 2015) (a representative of semi-supervised approaches) and SCORE+ (a fully unsupervised approach). We also compare our algorithm to typical GNN methods (Kipf & Welling, 2016) in the real data part. Simulations: To illustrate how information in AUU will improve the classification accuracy, we would consider Angle Min in (4) in simulations. Also, to cast light on how information on unlabeled data will ameliorate the classification accuracy, we consider a special version of Angle Min+ in simulations by feeding into the algorithm only ALL and XL. It ignores information on unlabeled data and only uses the subnetwork consisting of labeled nodes. We call it Angle Min+(subnetwork). This method is practically uninteresting, but it serves as a representative of the fully supervised approach that ignores unlabeled nodes. We simulate data from the DCBM with (n, K) = (500, 3). Our results are presented in Figure 1, which shows the average classification error of each algorithm as the number of labeled nodes, NL increases. The plots indicate that Angle Min+ outperforms other methods in all the cases. Real data: We consider three benchmark datasets for community detection, Caltech (Traud et al., 2012) , Simmons (Traud et al., 2012) , and Polblogs (Adamic & Glance, 2005).
Researcher Affiliation Academia Yicong Jiang, Tracy Ke Department of Statistics Harvard University Cambridge, MA 02138, USA
Pseudocode Yes A PSEUDO CODE OF THE ALGORITHM Below are the pseudo code of Angle Min+ which is deferred to the appendix due to the page limit. Algorithm 1: Angle Min+
Open Source Code Yes REPRODUCIBILITY STATEMENT: All the codes are available in the supplementary materials.
Open Datasets Yes We consider three benchmark datasets for community detection, Caltech (Traud et al., 2012) , Simmons (Traud et al., 2012) , and Polblogs (Adamic & Glance, 2005). For a fairer comparison, we also consider a real network, Citeseer (Sen et al., 2008), that contains node features.
Dataset Splits Yes For each data set, we separate nodes into 10 folds and treat each fold as the test data at a time, with the other 9 folds as training data.
Hardware Specification No The paper does not specify any hardware details (e.g., GPU models, CPU types, memory) used for running the experiments.
Software Dependencies No The paper does not provide specific version numbers for any software dependencies or libraries used.
Experiment Setup Yes Simulations: ...We simulate data from the DCBM with (n, K) = (500, 3). To generate P, we draw its (off diagonal) entries from Uniform(0, 1), and then symmetrize it. We generate the degree heterogeneity parameters θi i.i.d. from one of the 4 following distributions: n 0.5p log(n)Gamma(3.5), n 0.25Gamma(3.5), n 0.5p log(n)Pareto(3.5), n 0.25Pareto(3.5). ... We consider two cases of Π: the balanced case (bal.) and the imbalanced case (inbal.). In the former, π(i) are i.i.d. from Multinomial(1/3, 1/3, 1/3), and in the latter, π(i) are i.i.d. from Multinomial(0.2, 0.2, 0.6). We repeat the simulation 100 times. ... Real data: For each data set, we separate nodes into 10 folds and treat each fold as the test data at a time, with the other 9 folds as training data. In the training network, we randomly choose n L nodes as labeled nodes. We then estimate the label of each node in the test data and report the misclassification error rate (averaged over 10 folds). We consider n L/n {0.3, 0.5, 0.7}, where n is the number of nodes in training data.