Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Graph Inverse Style Transfer for Counterfactual Explainability

Authors: Bardh Prenkaj, Efstratios Zaradoukas, Gjergji Kasneci

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on 8 benchmark datasets spanning synthetic and real-world graphs with binary and multiclass classification tasks emphasizes GIST as consistently outperforming So TA. Specifically, GIST achieves considerably higher validity (+7.6% over the second-best) and improves fidelity by a large margin (+45.5%). Our results highlight GIST s ability to generate counterfactuals that are both more faithful and spectrally aligned (preserved semantics) with the input.
Researcher Affiliation	Academia	1Technical University of Munich, Germany 2Sapienza University of Rome, Italy. Correspondence to: Bardh Prenkaj <EMAIL>.
Pseudocode	Yes	Algorithm 1 Forward learning pass of GIST
Open Source Code	Yes	4Code: https://github.com/bardhprenkaj/gist
Open Datasets	Yes	Extensive experiments on 8 benchmark datasets spanning synthetic and real-world graphs with binary and multiclass classification tasks emphasizes GIST as consistently outperforming So TA. Specifically, GIST achieves considerably higher validity (+7.6% over the second-best) and improves fidelity by a large margin (+45.5%). Our results highlight GIST s ability to generate counterfactuals that are both more faithful and spectrally aligned (preserved semantics) with the input.
Dataset Splits	Yes	We use a 90:10 train-test split for all explainers and designate 10% of the training set as validation. We perform 5-fold cross validations to assess the performances of the explainers on one AMD EPYC 7002/3 64-Core CPU (for smaller models) and one Nvidia TESLA V100 (for larger models) totaling 450h of execution time.
Hardware Specification	Yes	We perform 5-fold cross validations to assess the performances of the explainers on one AMD EPYC 7002/3 64-Core CPU (for smaller models) and one Nvidia TESLA V100 (for larger models) totaling 450h of execution time.
Software Dependencies	No	The paper mentions using 'Adam optimizer' and 'RMS Propagation optimizer' with specific learning rates but does not provide specific version numbers for a comprehensive software stack (e.g., Python, PyTorch, CUDA, or other libraries).
Experiment Setup	Yes	For GIST we configured it to run the backtracking process for 50 epochs with a batch size of 16. We chose the number of attention heads to be equal to 2, the node embedding dimension to 16. We set α = 0.9 to encourage higher validity, which is beneficial for a helpful counterfactual. We train GIST with Adam optimizer with learning rate 10 3 and a weight decay of 10 5. For CF2 (Tan et al., 2022), we configured: 20 epochs, batch size ratio of 0.2, learning rate (lr) initialized at 0.02, and regularization parameters α = 0.7, λ = 20, and γ = 0.9. CF-GNNExp (Lucic et al., 2022) utilized: α = 0.01, K = 5, β = 0.6, and γ = 0.2. CLEAR (Ma et al., 2022) employed: 10 epochs, learning rate (lr) of 0.01, counterfactual loss regularization parameter (λcfe) set to 0.1, trade-off parameter α = 0.4, and batch size 32. RSGG-CE (Prado-Romero et al., 2024b) was trained for 500 epochs with a GAN configuration: batch size 1 and Top KPooling discriminator. Concerning the oracle implementation, we used the following hyperparameters: 50 epochs, batch size 32, and early stopping threshold 10 4. We trained the model using the RMS Propagation optimizer (learning rate lr = 0.01) with Cross Entropy loss. The architecture consisted of a Graph Convolutional Neural Network with 3 convolutional layers and 1 dense layer, convolutional booster 2, and linear decay factor 1.8.