Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

No Metric to Rule Them All: Toward Principled Evaluations of Graph-Learning Datasets

Authors: Corinna Coupette, Jeremy Wayland, Emily Simons, Bastian Rieck

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate the utility of our framework for dataset evaluation via extensive experiments on graph-level tasks and derive actionable recommendations for improving the evaluation of graph-learning methods.
Researcher Affiliation	Academia	1Aalto University, Finland 2Max Planck Institute for Informatics, Germany 3Helmholtz Munich, Germany 4TU Munich, Germany 5University of Fribourg, Switzerland.
Pseudocode	No	The paper formally defines concepts like Mode Perturbation (Definition 2.1), Feature Perturbations (Definition 2.2), Structural Perturbations (Definition 2.3), Dataset Perturbation (Definition 2.4), and others using mathematical notation, but it does not present any explicit block labeled 'Pseudocode' or 'Algorithm' with structured, step-by-step instructions like code.
Open Source Code	Yes	Our reproducibility package is publicly available via Zenodo at 10.5281/zenodo.15547322. The code is maintained on Git Hub at https://github.com/aidos-lab/rings.
Open Datasets	Yes	We evaluate 13 popular graph-classification datasets: From the life sciences, we select AIDS, ogbg-molhiv (Mol HIV), MUTAG, and NCI1 (small molecules), as well as DD, Enzymes, Peptides-func (Peptides), and PROTEINS-full (Proteins) (larger chemical structures). From the social sciences, we take COLLAB, IMDB-B, and IMDB-M (collaboration ego-networks), as well as REDDIT-B and REDDIT-M (online interactions).
Dataset Splits	Yes	Tuning Strategy 5-fold CV 64 consistent θ for each (φ(D), A)... Evaluation Strategy Tuned model (φ(D), A, θ) re-trained on distinct CV splits and random seeds, then evaluated on φ(D) s test set... Test/Train Split Seeds {67, 23, 77, 88, 54}
Hardware Specification	Yes	Available CPUs Intel Xeon (Haswell, Broadwell, Skylake, Cascade Lake, Sapphire Rapids, Emerald Rapids) Intel Xeon (6134, 6248R, 6142M, 6128, 6136, E5620) Intel Platinum (8280L, 8468, 8562Y+) AMD Opteron (6164 HE, 6234, 6376 (x2), 6272, 6128) AMD EPYC (7742, 7713, 7413, 7262) Available GPUs NVIDIA Tesla (K80, P100, V100, A100, H100, H200) NVIDIA Quadro (RTX 8000, RTX 6000) AMD MI100
Software Dependencies	No	The paper mentions software components such as Py Torch-geometric, GCN, GAT, GIN, and GPS Transformer, as well as the use of 'Py Torch global seed'. However, it does not provide specific version numbers for any of these software dependencies.
Experiment Setup	Yes	Table 5: GNN tuning parameters. For consistency, all (φ(D), A) were tuned across a consistent hyperparameter grid search. Activation ReLU Batch Size {64, 128} Dropout {0.1, 0.5} Fold {0, 1, 2, 3, 4} Hidden Dim {128, 256} Learning Rate (LR) {0.01, 0.001} Max Epochs 200 Normalization Batch Num Layers 3 Optimizer Adam Readout Sum Seed {0, 42} Weight Decay {0.0005, 0.005}