Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
No Metric to Rule Them All: Toward Principled Evaluations of Graph-Learning Datasets
Authors: Corinna Coupette, Jeremy Wayland, Emily Simons, Bastian Rieck
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the utility of our framework for dataset evaluation via extensive experiments on graph-level tasks and derive actionable recommendations for improving the evaluation of graph-learning methods. |
| Researcher Affiliation | Academia | 1Aalto University, Finland 2Max Planck Institute for Informatics, Germany 3Helmholtz Munich, Germany 4TU Munich, Germany 5University of Fribourg, Switzerland. |
| Pseudocode | No | The paper formally defines concepts like Mode Perturbation (Definition 2.1), Feature Perturbations (Definition 2.2), Structural Perturbations (Definition 2.3), Dataset Perturbation (Definition 2.4), and others using mathematical notation, but it does not present any explicit block labeled 'Pseudocode' or 'Algorithm' with structured, step-by-step instructions like code. |
| Open Source Code | Yes | Our reproducibility package is publicly available via Zenodo at 10.5281/zenodo.15547322. The code is maintained on Git Hub at https://github.com/aidos-lab/rings. |
| Open Datasets | Yes | We evaluate 13 popular graph-classification datasets: From the life sciences, we select AIDS, ogbg-molhiv (Mol HIV), MUTAG, and NCI1 (small molecules), as well as DD, Enzymes, Peptides-func (Peptides), and PROTEINS-full (Proteins) (larger chemical structures). From the social sciences, we take COLLAB, IMDB-B, and IMDB-M (collaboration ego-networks), as well as REDDIT-B and REDDIT-M (online interactions). |
| Dataset Splits | Yes | Tuning Strategy 5-fold CV 64 consistent θ for each (φ(D), A)... Evaluation Strategy Tuned model (φ(D), A, θ) re-trained on distinct CV splits and random seeds, then evaluated on φ(D) s test set... Test/Train Split Seeds {67, 23, 77, 88, 54} |
| Hardware Specification | Yes | Available CPUs Intel Xeon (Haswell, Broadwell, Skylake, Cascade Lake, Sapphire Rapids, Emerald Rapids) Intel Xeon (6134, 6248R, 6142M, 6128, 6136, E5620) Intel Platinum (8280L, 8468, 8562Y+) AMD Opteron (6164 HE, 6234, 6376 (x2), 6272, 6128) AMD EPYC (7742, 7713, 7413, 7262) Available GPUs NVIDIA Tesla (K80, P100, V100, A100, H100, H200) NVIDIA Quadro (RTX 8000, RTX 6000) AMD MI100 |
| Software Dependencies | No | The paper mentions software components such as Py Torch-geometric, GCN, GAT, GIN, and GPS Transformer, as well as the use of 'Py Torch global seed'. However, it does not provide specific version numbers for any of these software dependencies. |
| Experiment Setup | Yes | Table 5: GNN tuning parameters. For consistency, all (φ(D), A) were tuned across a consistent hyperparameter grid search. Activation ReLU Batch Size {64, 128} Dropout {0.1, 0.5} Fold {0, 1, 2, 3, 4} Hidden Dim {128, 256} Learning Rate (LR) {0.01, 0.001} Max Epochs 200 Normalization Batch Num Layers 3 Optimizer Adam Readout Sum Seed {0, 42} Weight Decay {0.0005, 0.005} |