Local Graph Clustering with Noisy Labels

Authors: Artur Back de Luca, Kimon Fountoulakis, Shenghao Yang

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental From a theoretical perspective, we consider recovering an unknown target cluster with a single seed node in a random graph with independent noisy node labels. We provide sufficient conditions on the label noise under which, with high probability, using diffusion in the weighted graph yields a more accurate recovery of the target cluster. This approach proves more effective than using the given labels alone or using diffusion in the label-free original graph. Empirically, we show that reliable node labels can be obtained with just a few samples from an attributed graph. Moreover, utilizing these labels via diffusion in the weighted graph leads to significantly better local clustering performance across several real-world datasets, improving F1 scores by up to 13%.
Researcher Affiliation Academia Artur Back de Luca, Kimon Fountoulakis, Shenghao Yang School of Computer Science, University of Waterloo, Canada {abackdel,kimon.fountoulakis,shenghao.yang}@uwaterloo.ca
Pseudocode No The paper refers to 'Algorithm 1 from Yang & Fountoulakis (2023)' but does not provide its own pseudocode or algorithm block.
Open Source Code No The paper does not provide an explicit statement about the release of its source code or a link to a code repository.
Open Datasets Yes We carry out experiments over the following 6 real-world attributed graph datasets. We include all 3 datasets used in Yang & Fountoulakis (2023) namely, Amazon Photo (Mc Auley et al., 2015), Coauthor CS, and Coauthor Physics (Shchur et al., 2018) to ensure compatibility of results. Additionally, we use 3 well-established graph machine learning benchmarks: Amazon Computers (Shchur et al., 2018), Cora (Mc Callum et al., 2000) and Pubmed (Sen et al., 2008).
Dataset Splits Yes We divide the experiments into two settings. In the first, we assume access to a selected number of ground-truth labels, evenly sampled from both the target and non-target classes. These nodes are utilized to train a classifier (without graph information). The predictions of the classifier are then used as noisy labels to construct the weighted graph Gw as defined in (4), and we set ϵ = 0.05 as in the synthetic experiments. We use all the positive nodes, i.e. nodes that belong to the target cluster based on the given ground-truth labels, as seed nodes during the diffusion process. For each cluster in each dataset, we compare LFD against FD and WFD over 100 trials. For each trial, a classifier is trained using randomly sampled positive and negative nodes which we treat as ground-truth information.
Hardware Specification Yes The experiments are run on an Intel i9-13900K CPU with 36MB Cache and 2 x 48GB DDR5 RAM.
Software Dependencies No The paper mentions software like 'Logistic Regression model' and 'GCN architecture' but does not provide specific version numbers for these or other software dependencies.
Experiment Setup Yes For edge weights (4) in Gw we set ϵ = 0.05. We set the sink capacity Ti = 1 for all nodes. For the source mass at the seed node, we set it to αk for α = 2, 2.25, . . . , 4, out of which we select the one that results in the highest F1 score based on supp(x ).