Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Graph Inverse Style Transfer for Counterfactual Explainability
Authors: Bardh Prenkaj, Efstratios Zaradoukas, Gjergji Kasneci
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on 8 benchmark datasets spanning synthetic and real-world graphs with binary and multiclass classification tasks emphasizes GIST as consistently outperforming So TA. Specifically, GIST achieves considerably higher validity (+7.6% over the second-best) and improves fidelity by a large margin (+45.5%). Our results highlight GIST s ability to generate counterfactuals that are both more faithful and spectrally aligned (preserved semantics) with the input. |
| Researcher Affiliation | Academia | 1Technical University of Munich, Germany 2Sapienza University of Rome, Italy. Correspondence to: Bardh Prenkaj <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Forward learning pass of GIST |
| Open Source Code | Yes | 4Code: https://github.com/bardhprenkaj/gist |
| Open Datasets | Yes | Extensive experiments on 8 benchmark datasets spanning synthetic and real-world graphs with binary and multiclass classification tasks emphasizes GIST as consistently outperforming So TA. Specifically, GIST achieves considerably higher validity (+7.6% over the second-best) and improves fidelity by a large margin (+45.5%). Our results highlight GIST s ability to generate counterfactuals that are both more faithful and spectrally aligned (preserved semantics) with the input. |
| Dataset Splits | Yes | We use a 90:10 train-test split for all explainers and designate 10% of the training set as validation. We perform 5-fold cross validations to assess the performances of the explainers on one AMD EPYC 7002/3 64-Core CPU (for smaller models) and one Nvidia TESLA V100 (for larger models) totaling 450h of execution time. |
| Hardware Specification | Yes | We perform 5-fold cross validations to assess the performances of the explainers on one AMD EPYC 7002/3 64-Core CPU (for smaller models) and one Nvidia TESLA V100 (for larger models) totaling 450h of execution time. |
| Software Dependencies | No | The paper mentions using 'Adam optimizer' and 'RMS Propagation optimizer' with specific learning rates but does not provide specific version numbers for a comprehensive software stack (e.g., Python, PyTorch, CUDA, or other libraries). |
| Experiment Setup | Yes | For GIST we configured it to run the backtracking process for 50 epochs with a batch size of 16. We chose the number of attention heads to be equal to 2, the node embedding dimension to 16. We set α = 0.9 to encourage higher validity, which is beneficial for a helpful counterfactual. We train GIST with Adam optimizer with learning rate 10 3 and a weight decay of 10 5. For CF2 (Tan et al., 2022), we configured: 20 epochs, batch size ratio of 0.2, learning rate (lr) initialized at 0.02, and regularization parameters α = 0.7, λ = 20, and γ = 0.9. CF-GNNExp (Lucic et al., 2022) utilized: α = 0.01, K = 5, β = 0.6, and γ = 0.2. CLEAR (Ma et al., 2022) employed: 10 epochs, learning rate (lr) of 0.01, counterfactual loss regularization parameter (λcfe) set to 0.1, trade-off parameter α = 0.4, and batch size 32. RSGG-CE (Prado-Romero et al., 2024b) was trained for 500 epochs with a GAN configuration: batch size 1 and Top KPooling discriminator. Concerning the oracle implementation, we used the following hyperparameters: 50 epochs, batch size 32, and early stopping threshold 10 4. We trained the model using the RMS Propagation optimizer (learning rate lr = 0.01) with Cross Entropy loss. The architecture consisted of a Graph Convolutional Neural Network with 3 convolutional layers and 1 dense layer, convolutional booster 2, and linear decay factor 1.8. |