reproducibilityindex.ai

Semi-supervised Community Detection via Structural Similarity Metrics

Authors: Yicong Jiang, Tracy Ke

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We study the performance of Angel Min+, where ˆΠU is from SCORE+ (Jin et al., 2021). We compare our methods with SNMF (Yang et al., 2015) (a representative of semi-supervised approaches) and SCORE+ (a fully unsupervised approach). We also compare our algorithm to typical GNN methods (Kipf & Welling, 2016) in the real data part. Simulations: To illustrate how information in AUU will improve the classification accuracy, we would consider Angle Min in (4) in simulations. Also, to cast light on how information on unlabeled data will ameliorate the classification accuracy, we consider a special version of Angle Min+ in simulations by feeding into the algorithm only ALL and XL. It ignores information on unlabeled data and only uses the subnetwork consisting of labeled nodes. We call it Angle Min+(subnetwork). This method is practically uninteresting, but it serves as a representative of the fully supervised approach that ignores unlabeled nodes. We simulate data from the DCBM with (n, K) = (500, 3). Our results are presented in Figure 1, which shows the average classification error of each algorithm as the number of labeled nodes, NL increases. The plots indicate that Angle Min+ outperforms other methods in all the cases. Real data: We consider three benchmark datasets for community detection, Caltech (Traud et al., 2012) , Simmons (Traud et al., 2012) , and Polblogs (Adamic & Glance, 2005).
Researcher Affiliation	Academia	Yicong Jiang, Tracy Ke Department of Statistics Harvard University Cambridge, MA 02138, USA
Pseudocode	Yes	A PSEUDO CODE OF THE ALGORITHM Below are the pseudo code of Angle Min+ which is deferred to the appendix due to the page limit. Algorithm 1: Angle Min+
Open Source Code	Yes	REPRODUCIBILITY STATEMENT: All the codes are available in the supplementary materials.
Open Datasets	Yes	We consider three benchmark datasets for community detection, Caltech (Traud et al., 2012) , Simmons (Traud et al., 2012) , and Polblogs (Adamic & Glance, 2005). For a fairer comparison, we also consider a real network, Citeseer (Sen et al., 2008), that contains node features.
Dataset Splits	Yes	For each data set, we separate nodes into 10 folds and treat each fold as the test data at a time, with the other 9 folds as training data.
Hardware Specification	No	The paper does not specify any hardware details (e.g., GPU models, CPU types, memory) used for running the experiments.
Software Dependencies	No	The paper does not provide specific version numbers for any software dependencies or libraries used.
Experiment Setup	Yes	Simulations: ...We simulate data from the DCBM with (n, K) = (500, 3). To generate P, we draw its (off diagonal) entries from Uniform(0, 1), and then symmetrize it. We generate the degree heterogeneity parameters θi i.i.d. from one of the 4 following distributions: n 0.5p log(n)Gamma(3.5), n 0.25Gamma(3.5), n 0.5p log(n)Pareto(3.5), n 0.25Pareto(3.5). ... We consider two cases of Π: the balanced case (bal.) and the imbalanced case (inbal.). In the former, π(i) are i.i.d. from Multinomial(1/3, 1/3, 1/3), and in the latter, π(i) are i.i.d. from Multinomial(0.2, 0.2, 0.6). We repeat the simulation 100 times. ... Real data: For each data set, we separate nodes into 10 folds and treat each fold as the test data at a time, with the other 9 folds as training data. In the training network, we randomly choose n L nodes as labeled nodes. We then estimate the label of each node in the test data and report the misclassification error rate (averaged over 10 folds). We consider n L/n {0.3, 0.5, 0.7}, where n is the number of nodes in training data.