Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

No Metric to Rule Them All: Toward Principled Evaluations of Graph-Learning Datasets

Authors: Corinna Coupette, Jeremy Wayland, Emily Simons, Bastian Rieck

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the utility of our framework for dataset evaluation via extensive experiments on graph-level tasks and derive actionable recommendations for improving the evaluation of graph-learning methods.
Researcher Affiliation Academia 1Aalto University, Finland 2Max Planck Institute for Informatics, Germany 3Helmholtz Munich, Germany 4TU Munich, Germany 5University of Fribourg, Switzerland.
Pseudocode No The paper formally defines concepts like Mode Perturbation (Definition 2.1), Feature Perturbations (Definition 2.2), Structural Perturbations (Definition 2.3), Dataset Perturbation (Definition 2.4), and others using mathematical notation, but it does not present any explicit block labeled 'Pseudocode' or 'Algorithm' with structured, step-by-step instructions like code.
Open Source Code Yes Our reproducibility package is publicly available via Zenodo at 10.5281/zenodo.15547322. The code is maintained on Git Hub at https://github.com/aidos-lab/rings.
Open Datasets Yes We evaluate 13 popular graph-classification datasets: From the life sciences, we select AIDS, ogbg-molhiv (Mol HIV), MUTAG, and NCI1 (small molecules), as well as DD, Enzymes, Peptides-func (Peptides), and PROTEINS-full (Proteins) (larger chemical structures). From the social sciences, we take COLLAB, IMDB-B, and IMDB-M (collaboration ego-networks), as well as REDDIT-B and REDDIT-M (online interactions).
Dataset Splits Yes Tuning Strategy 5-fold CV 64 consistent θ for each (φ(D), A)... Evaluation Strategy Tuned model (φ(D), A, θ) re-trained on distinct CV splits and random seeds, then evaluated on φ(D) s test set... Test/Train Split Seeds {67, 23, 77, 88, 54}
Hardware Specification Yes Available CPUs Intel Xeon (Haswell, Broadwell, Skylake, Cascade Lake, Sapphire Rapids, Emerald Rapids) Intel Xeon (6134, 6248R, 6142M, 6128, 6136, E5620) Intel Platinum (8280L, 8468, 8562Y+) AMD Opteron (6164 HE, 6234, 6376 (x2), 6272, 6128) AMD EPYC (7742, 7713, 7413, 7262) Available GPUs NVIDIA Tesla (K80, P100, V100, A100, H100, H200) NVIDIA Quadro (RTX 8000, RTX 6000) AMD MI100
Software Dependencies No The paper mentions software components such as Py Torch-geometric, GCN, GAT, GIN, and GPS Transformer, as well as the use of 'Py Torch global seed'. However, it does not provide specific version numbers for any of these software dependencies.
Experiment Setup Yes Table 5: GNN tuning parameters. For consistency, all (φ(D), A) were tuned across a consistent hyperparameter grid search. Activation ReLU Batch Size {64, 128} Dropout {0.1, 0.5} Fold {0, 1, 2, 3, 4} Hidden Dim {128, 256} Learning Rate (LR) {0.01, 0.001} Max Epochs 200 Normalization Batch Num Layers 3 Optimizer Adam Readout Sum Seed {0, 42} Weight Decay {0.0005, 0.005}