Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Causally-guided Regularization of Graph Attention Improves Generalizability
Authors: Alexander P Wu, Thomas Markovich, Bonnie Berger, Nils Yannick Hammerla, Rohit Singh
TMLR 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We assessed the effectiveness of CAR by comparing the performance of a diverse range of models trained with and without CAR on 8 node classification datasets. Specifically, we aimed to assess the consistency of CAR s outperformance over matching baseline models across various graph attention mechanism and hyperparameter choices. Accordingly, we evaluated numerous combinations of such configurations (48 settings for each dataset and graph attention mechanism), rather than testing only a limited set of optimized hyperparameter configurations. The configurable model design and hyperparameter choices that we evaluated include the graph attention mechanism (GAT, GATv2, or Graph Transformer), the number of graph attention layers L = {1, 2}, the number of attention heads K = {1, 3, 5}, the number of hidden dimensions F = {10, 25, 100, 200}, and the regularization strength λ = {0.1, 0.5, 1, 5}. In all experiments, we used R = 5 interventions per node during training. See Appendix A.3 for details on the network architecture, hyperparameters, and training configurations. |
| Researcher Affiliation | Collaboration | Alexander P. Wu EMAIL MIT CSAIL Thomas Markovich EMAIL Twitter Cortex Bonnie Berger EMAIL MIT CSAIL and Department of Mathematics Nils Y. Hammerla EMAIL Twitter Cortex Rohit Singh EMAIL MIT CSAIL |
| Pseudocode | Yes | Algorithm 1 CAR Framework |
| Open Source Code | Yes | To ensure the reproducibility of the results in this paper, we have included the source code for our method as supplementary materials. |
| Open Datasets | Yes | We used a total of 8 real-world node classification datasets of varying sizes and degrees of homophily: Cora, Cite Seer, Pub Med, ogbn-arxiv, Chameleon, Squirrel, Cornell and Wisconsin. Each model was evaluated according to its accuracy on a held-out test set. The datasets used in this paper are all publicly available, and we also use the publicly available train/validation/test splits for these datasets. We provide details on these datasets in the Appendix and have provided references to them in both the main text and the Appendix. |
| Dataset Splits | Yes | For all datasets, we use the publically available train/validation/test splits that accompany these datasets. Each dataset was partitioned into training, validation, and test splits in line with previous work (Appendix A.4), and early stopping was applied during training with respect to the validation loss. |
| Hardware Specification | Yes | Training was performed on a single NVIDIA Tesla T4 GPU. |
| Software Dependencies | No | Models were implemented in Py Torch and Py Torch Geometric (Fey & Lenssen, 2019). The paper mentions PyTorch and PyTorch Geometric, but does not specify their version numbers, nor the version of Python or CUDA used. |
| Experiment Setup | Yes | The configurable model design and hyperparameter choices that we evaluated include the graph attention mechanism (GAT, GATv2, or Graph Transformer), the number of graph attention layers L = {1, 2}, the number of attention heads K = {1, 3, 5}, the number of hidden dimensions F = {10, 25, 100, 200}, and the regularization strength λ = {0.1, 0.5, 1, 5}. In all experiments, we used R = 5 interventions per node during training. See Appendix A.3 for details on the network architecture, hyperparameters, and training configurations. We used cross-entropy loss for the prediction loss ℓp( , ) and binary cross-entropy loss for the causal regularization loss ℓc( , ). The link function σ( ) was chosen to be the sigmoid function with temperature T = 0.1. Unless otherwise specified, we performed R = 5 rounds of edge interventions per mini-batch when training with CAR. All models were trained using the Adam optimizer with a learning rate of 0.01 and mini-batch size of 10,000. |