Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

On the Adversarial Robustness of Causal Algorithmic Recourse

Authors: Ricardo Dominguez-Olmedo, Amir H Karimi, Bernhard Schölkopf

ICML 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate their effectiveness on five tabular datasets, for linear and neural network classifiers. We present the experiment results in Figure 3. We present the experimental results in Figure 4. We empirically evaluate whether training the decision-making classifier with the proposed ALLR regularizer facilitates the existence of adversarially robust recourse.
Researcher Affiliation	Academia	1Max Planck Institute for Intelligent Systems, Tübingen, Germany 2University of Tübingen, Germany 3ETH Zürich, Switzerland.
Pseudocode	Yes	Algorithm 1 Generate adversarially robust recourse for a differentiable classifier h and differentiable SCM M.
Open Source Code	Yes	We open source our implementations and experiments2. 2github.com/Ricardo Dominguez/Adversarially Robust Recourse
Open Datasets	Yes	We consider four real-world datasets and one semi-synthetic dataset. For the causal recourse setting, we consider the COMPAS recidivism dataset (Larson et al., 2016) and the Adult demographic dataset (Kohavi & Becker, 1996), for which we adopt the causal graphs assumed in Nabi & Shpitser (2018). We additionally consider one semi-synthetic SCM introduced by Karimi et al. (2020), which is inspired in a loan approval setting. For the non-causal recourse setting, we consider the South German Credit dataset (Groemping, 2019), as well as a recidivism dataset (Schmidt & Witte, 1988) from North Carolina which we refer to as Bail.
Dataset Splits	No	The paper states using an '80%-20% train-test split' and tuning epochs for 'best predictive performance', implying internal validation, but does not explicitly detail a separate validation split percentage or specific samples within the main text.
Hardware Specification	No	The paper does not provide specific details about the hardware used for running experiments, such as GPU/CPU models, memory, or cloud computing instance types.
Software Dependencies	No	The paper mentions 'Adam (Kingma & Ba, 2015) as the optimizer' but does not specify version numbers for general software dependencies or libraries (e.g., Python, PyTorch, TensorFlow, scikit-learn) that would be needed for replication.
Experiment Setup	Yes	We use Adam (Kingma & Ba, 2015) as the optimizer with a learning rate of 10 3 and a batch size of 100. To determine a suitable number of training epochs for each dataset and training objective, we train for 500 epochs and select the number of training epochs which leads to the best predictive performance in terms of accuracy and Mathews Correlation Coefficient (MCC). For ALLR with NN classifiers, we heuristically find that µ1 = 3.0 works well across all datasets. We additionally perform hyperparameter search over µ2 {0.01, 0.1, 0.5, 3.0}.