Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
GraphCleaner: Detecting Mislabelled Samples in Popular Graph Learning Benchmarks
Authors: Yuwen Li, Miao Xiong, Bryan Hooi
ICML 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical evaluations on 6 datasets and 6 experimental settings demonstrate that GRAPHCLEANER outperforms the closest baseline, with an average improvement of 0.14 in F1 score, and 0.16 in MCC. |
| Researcher Affiliation | Academia | 1School of Computing, National University of Singapore, Singapore 2Institute of Data Science, National University of Singapore, Singapore. |
| Pseudocode | Yes | Algorithm 1 Synthetic Mislabel Dataset Generation; Algorithm 2 Neighborhood-Aware Mislabel Detector |
| Open Source Code | Yes | Corrected datasets and code are available at https://github.com/lywww/Graph Cleaner/tree/master. |
| Open Datasets | Yes | We use 6 datasets, namely, Cora, Cite Seer and Pub Med (Yang et al., 2016), Computers and Photo (Shchur et al., 2018), OGB-arxiv (Hu et al., 2020)... We publicly release 2 improved variants of Pub Med dataset: Pub Med Cleaned and Pub Med Multi for more accurate evaluation. |
| Dataset Splits | Yes | The node set V is partitioned into training, validation, and test sets, denoted by Vtrain, Vval, and Vtest. |
| Hardware Specification | No | No specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running experiments were mentioned in the paper. |
| Software Dependencies | No | No specific software dependencies with version numbers were mentioned in the paper. |
| Experiment Setup | Yes | We use three mislabel rates, ϵ = 0.1, 0.05, 0.025, for realistic concern. ... Specifically, ϵ is set as 0.1, 0.05, 0.025... Since the average label error reported in Northcutt et al. (2021b) is 3.4%, we simply set the threshold as 0.97. All our experiments and case studies use this threshold. ... The maximum neighborhood size K determines the range of neighborhood we consider. To investigate the robustness of GRAPHCLEANER to K, we vary K from 1 to 5 with other parameters fixed. |