Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Pruning Spurious Subgraphs for Graph Out-of-Distribution Generalization

Authors: Tianjun Yao, Haoxuan Li, Yongqiang Chen, Tongliang Liu, Le Song, Eric P Xing, Zhiqiang Shen

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we evaluate the effectiveness of Prun E on both synthetic datasets and real-world datasets, and answer the following research questions. RQ1. How does our method perform compared with SOTA baselines? RQ2. How do the individual components and hyperparameters in Prun E affect the overall performance? RQ3. Can the optimal subgraph selector t (G) correctly identify Gc? RQ4. Do edges in Gc predicted by t( ) exhibit higher probability scores than edges in Gs? RQ5. How does Prun E perform on datasets with concept shift? RQ6. How do different GNN architectures impact the OOD performance? More details on the datasets, experiment setup and experimental results are presented in Appendix J.
Researcher Affiliation	Academia	1Mohamed bin Zayed University of Artificial Intelligence 2Carnegie Mellon University 3The University of Sydney EMAIL EMAIL, EMAIL
Pseudocode	Yes	The pseudocode of Prun E is shown in Appendix E. Algorithm 1 The proposed method
Open Source Code	Yes	Codes are available at: https://github.com/tianyao-aka/Prun E-Graph OOD.
Open Datasets	Yes	Datasets. We adopt GOOD datasets [15], OGBG-Molbbbp datasets [18, 60], and Drug OOD datasets [21] to comprehensively evaluate the OOD generalization performance of our proposed framework.
Dataset Splits	Yes	GOOD-Motif [15] dataset with base split for the case study. GOOD-HIV is a molecular dataset derived from the Molecule Net [60] benchmark... We adopt the covariate shift split... GOOD-Motif [15]... We employ the covariate shift split... OGBG-Molbbbp [18]... we create scaffold shift and graph size shift to evaluate our method. Drug OOD [21] dataset... This benchmark offers three environment-splitting strategies: Assay, Scaffold, and Size. Table 7: Details about the datasets used in our experiments. DATASETS Split # TRAINING # VALIDATION # TESTING # CLASSES METRICS GOOD-Motif Base 18000 3000 3000 3 ACC Size 18000 3000 3000 3 ACC SPMotif Correlation 9000 3000 3000 3 ACC GOOD-HIV Scaffold 24682 4113 4108 2 ROC-AUC Size 26169 4112 3961 2 ROC-AUC
Hardware Specification	Yes	We conduct all experiments using Py Torch [43] (v2.1.2) and Py Torch Geometric [12] on Linux servers equipped with NVIDIA RTX4090 GPUs and CUDA 12.1.
Software Dependencies	Yes	We conduct all experiments using Py Torch [43] (v2.1.2) and Py Torch Geometric [12] on Linux servers equipped with NVIDIA RTX4090 GPUs and CUDA 12.1.
Experiment Setup	Yes	Training and Validation. By default, we use Adam optimizer [24] with a learning rate of 1e 3 and a batch size of 64 for all experiments. For Drug OOD, GOOD-Motif and GOOD-HIV datasets, our method is pretrained for 10 epochs with ERM, and for other datasets, we do not use ERM pretraining. We employ an early stopping of 10 epochs according to the validation performance for Drug OOD datasets and GOOD-Motif datasets, and do not employ early stopping for other datasets. Test accuracy or ROC-AUC is obtained according to the best validation performance for all experiments. All experiments are run with 4 different random seeds, the mean and standard deviation are reported using the 4 runs of experiments. Hyperparameter search for Prun E. For Prun E, the edge budget η in Le is searched over: {0.5, 0.75, 0.85}; K for the K% edges with lowest probability score in Ls is searched over:{50, 70, 90}; λ1, λ2 for balancing Le and Ls are searched over: {10, 40} and {1e 1, 1e 2, 1e 3} respectively. The encoder of subgraph selector t( ) is searched over {GIN, GCN}, with the number of layers: {2, 3}.