Local Graph Clustering with Noisy Labels
Authors: Artur Back de Luca, Kimon Fountoulakis, Shenghao Yang
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | From a theoretical perspective, we consider recovering an unknown target cluster with a single seed node in a random graph with independent noisy node labels. We provide sufficient conditions on the label noise under which, with high probability, using diffusion in the weighted graph yields a more accurate recovery of the target cluster. This approach proves more effective than using the given labels alone or using diffusion in the label-free original graph. Empirically, we show that reliable node labels can be obtained with just a few samples from an attributed graph. Moreover, utilizing these labels via diffusion in the weighted graph leads to significantly better local clustering performance across several real-world datasets, improving F1 scores by up to 13%. |
| Researcher Affiliation | Academia | Artur Back de Luca, Kimon Fountoulakis, Shenghao Yang School of Computer Science, University of Waterloo, Canada {abackdel,kimon.fountoulakis,shenghao.yang}@uwaterloo.ca |
| Pseudocode | No | The paper refers to 'Algorithm 1 from Yang & Fountoulakis (2023)' but does not provide its own pseudocode or algorithm block. |
| Open Source Code | No | The paper does not provide an explicit statement about the release of its source code or a link to a code repository. |
| Open Datasets | Yes | We carry out experiments over the following 6 real-world attributed graph datasets. We include all 3 datasets used in Yang & Fountoulakis (2023) namely, Amazon Photo (Mc Auley et al., 2015), Coauthor CS, and Coauthor Physics (Shchur et al., 2018) to ensure compatibility of results. Additionally, we use 3 well-established graph machine learning benchmarks: Amazon Computers (Shchur et al., 2018), Cora (Mc Callum et al., 2000) and Pubmed (Sen et al., 2008). |
| Dataset Splits | Yes | We divide the experiments into two settings. In the first, we assume access to a selected number of ground-truth labels, evenly sampled from both the target and non-target classes. These nodes are utilized to train a classifier (without graph information). The predictions of the classifier are then used as noisy labels to construct the weighted graph Gw as defined in (4), and we set ϵ = 0.05 as in the synthetic experiments. We use all the positive nodes, i.e. nodes that belong to the target cluster based on the given ground-truth labels, as seed nodes during the diffusion process. For each cluster in each dataset, we compare LFD against FD and WFD over 100 trials. For each trial, a classifier is trained using randomly sampled positive and negative nodes which we treat as ground-truth information. |
| Hardware Specification | Yes | The experiments are run on an Intel i9-13900K CPU with 36MB Cache and 2 x 48GB DDR5 RAM. |
| Software Dependencies | No | The paper mentions software like 'Logistic Regression model' and 'GCN architecture' but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | For edge weights (4) in Gw we set ϵ = 0.05. We set the sink capacity Ti = 1 for all nodes. For the source mass at the seed node, we set it to αk for α = 2, 2.25, . . . , 4, out of which we select the one that results in the highest F1 score based on supp(x ). |