Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

GraphCleaner: Detecting Mislabelled Samples in Popular Graph Learning Benchmarks

Authors: Yuwen Li, Miao Xiong, Bryan Hooi

ICML 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical evaluations on 6 datasets and 6 experimental settings demonstrate that GRAPHCLEANER outperforms the closest baseline, with an average improvement of 0.14 in F1 score, and 0.16 in MCC.
Researcher Affiliation	Academia	1School of Computing, National University of Singapore, Singapore 2Institute of Data Science, National University of Singapore, Singapore.
Pseudocode	Yes	Algorithm 1 Synthetic Mislabel Dataset Generation; Algorithm 2 Neighborhood-Aware Mislabel Detector
Open Source Code	Yes	Corrected datasets and code are available at https://github.com/lywww/Graph Cleaner/tree/master.
Open Datasets	Yes	We use 6 datasets, namely, Cora, Cite Seer and Pub Med (Yang et al., 2016), Computers and Photo (Shchur et al., 2018), OGB-arxiv (Hu et al., 2020)... We publicly release 2 improved variants of Pub Med dataset: Pub Med Cleaned and Pub Med Multi for more accurate evaluation.
Dataset Splits	Yes	The node set V is partitioned into training, validation, and test sets, denoted by Vtrain, Vval, and Vtest.
Hardware Specification	No	No specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running experiments were mentioned in the paper.
Software Dependencies	No	No specific software dependencies with version numbers were mentioned in the paper.
Experiment Setup	Yes	We use three mislabel rates, ϵ = 0.1, 0.05, 0.025, for realistic concern. ... Specifically, ϵ is set as 0.1, 0.05, 0.025... Since the average label error reported in Northcutt et al. (2021b) is 3.4%, we simply set the threshold as 0.97. All our experiments and case studies use this threshold. ... The maximum neighborhood size K determines the range of neighborhood we consider. To investigate the robustness of GRAPHCLEANER to K, we vary K from 1 to 5 with other parameters fixed.