Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Causally-guided Regularization of Graph Attention Improves Generalizability

Authors: Alexander P Wu, Thomas Markovich, Bonnie Berger, Nils Yannick Hammerla, Rohit Singh

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We assessed the effectiveness of CAR by comparing the performance of a diverse range of models trained with and without CAR on 8 node classification datasets. Specifically, we aimed to assess the consistency of CAR s outperformance over matching baseline models across various graph attention mechanism and hyperparameter choices. Accordingly, we evaluated numerous combinations of such configurations (48 settings for each dataset and graph attention mechanism), rather than testing only a limited set of optimized hyperparameter configurations. The configurable model design and hyperparameter choices that we evaluated include the graph attention mechanism (GAT, GATv2, or Graph Transformer), the number of graph attention layers L = {1, 2}, the number of attention heads K = {1, 3, 5}, the number of hidden dimensions F = {10, 25, 100, 200}, and the regularization strength λ = {0.1, 0.5, 1, 5}. In all experiments, we used R = 5 interventions per node during training. See Appendix A.3 for details on the network architecture, hyperparameters, and training configurations.
Researcher Affiliation	Collaboration	Alexander P. Wu EMAIL MIT CSAIL Thomas Markovich EMAIL Twitter Cortex Bonnie Berger EMAIL MIT CSAIL and Department of Mathematics Nils Y. Hammerla EMAIL Twitter Cortex Rohit Singh EMAIL MIT CSAIL
Pseudocode	Yes	Algorithm 1 CAR Framework
Open Source Code	Yes	To ensure the reproducibility of the results in this paper, we have included the source code for our method as supplementary materials.
Open Datasets	Yes	We used a total of 8 real-world node classification datasets of varying sizes and degrees of homophily: Cora, Cite Seer, Pub Med, ogbn-arxiv, Chameleon, Squirrel, Cornell and Wisconsin. Each model was evaluated according to its accuracy on a held-out test set. The datasets used in this paper are all publicly available, and we also use the publicly available train/validation/test splits for these datasets. We provide details on these datasets in the Appendix and have provided references to them in both the main text and the Appendix.
Dataset Splits	Yes	For all datasets, we use the publically available train/validation/test splits that accompany these datasets. Each dataset was partitioned into training, validation, and test splits in line with previous work (Appendix A.4), and early stopping was applied during training with respect to the validation loss.
Hardware Specification	Yes	Training was performed on a single NVIDIA Tesla T4 GPU.
Software Dependencies	No	Models were implemented in Py Torch and Py Torch Geometric (Fey & Lenssen, 2019). The paper mentions PyTorch and PyTorch Geometric, but does not specify their version numbers, nor the version of Python or CUDA used.
Experiment Setup	Yes	The configurable model design and hyperparameter choices that we evaluated include the graph attention mechanism (GAT, GATv2, or Graph Transformer), the number of graph attention layers L = {1, 2}, the number of attention heads K = {1, 3, 5}, the number of hidden dimensions F = {10, 25, 100, 200}, and the regularization strength λ = {0.1, 0.5, 1, 5}. In all experiments, we used R = 5 interventions per node during training. See Appendix A.3 for details on the network architecture, hyperparameters, and training configurations. We used cross-entropy loss for the prediction loss ℓp( , ) and binary cross-entropy loss for the causal regularization loss ℓc( , ). The link function σ( ) was chosen to be the sigmoid function with temperature T = 0.1. Unless otherwise specified, we performed R = 5 rounds of edge interventions per mini-batch when training with CAR. All models were trained using the Adam optimizer with a learning rate of 0.01 and mini-batch size of 10,000.