Community Detection Guarantees using Embeddings Learned by Node2Vec

Authors: Andrew Davison, S. Carlyle Morgan, Owen G. Ward

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our main result shows that the use of k-means clustering on the embedding vectors produced by node2vec gives weakly consistent community recovery for the nodes in (degree corrected) stochastic block models. We demonstrate this result empirically for both real and simulated networks, and examine how this relates to other embedding tools and machine learning procedures for network data.
Researcher Affiliation Academia Andrew Davison Department of Statistics Columbia University New York, NY 10027 ad3395@columbia.eduS. Carlyle Morgan Department of Statistics University of Michigan Ann Arbor, MA 48109 scmorgan@umich.eduOwen G. Ward Department of Statistics and Actuarial Science Simon Fraser University Burnaby, British Columbia owen_ward@sfu.ca
Pseudocode No The paper does not contain any sections or figures explicitly labeled 'Pseudocode' or 'Algorithm'.
Open Source Code Yes All code required to reproduce all results is included in the code repository in the supplemental files.
Open Datasets Yes We first simulate data from the planted partition stochastic block model, SBM(n/κ, κ, p, q, ρn). We use two publicly available networks containing known community structure. We first consider a network of emails between 1005 members of a large research institution, available as part of the Stanford Network Analysis Project [31]. We also consider a widely used dataset of directed edges between 1490 U.S political blogs, collected before the 2004 elections [2].
Dataset Splits Yes We then use the true community labels of 10% of these nodes to train a (multinomial) logistic regression classifier, and predict the class label for the remaining 90% of nodes in the network.
Hardware Specification Yes All experiments were run on a computing cluster utilising 4 cores of an Intel E5-2683 v4 Broadwell 2.1GHz CPU or similar with 2 GB of memory per core.
Software Dependencies No The paper mentions using 'node2vec' implementation but does not specify its version number or any other software with specific versions.
Experiment Setup Yes We use an embedding dimension of 64 and do not modify other default tuning parameters for the embedding procedure unless specified, so that p = q = 1. We investigate the role of these tuning parameters below, allowing them to vary as is considered in node2vec. We pass these embedding vectors into k-means clustering, where k = κ, the true number of communities present in the network.