Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Performative Validity of Recourse Explanations
Authors: Gunnar König, Hidde Fokkema, Timo Freiesleben, Celestine Mendler-Dünner, Ulrike V. Luxburg
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our theory suggests that the performative validity of recourse critically depends on the recourse method and the functional form of the underlying causal relationships. To investigate this empirically, we study the performative effects of CE, and the different versions of CR and ICR in synthetic and real-world settings. Specifically, we study the following classes of structural equations: (i) Additive noise (LAdd): fj(x, u) = l(x) + u where l is linear (ii) Multiplicative noise (LMult): fj(x, u) = m(x)u where m is linear in each dimension (iii) Nonlinear relations and additive noise (NLAdd): fj(x, u) = g(x) + u where g nonlinear (iv) Nonlinear relations and multiplicative noise (NLMult): fj(x, u) = g(x)u where g nonlinear (v) Polynomial noise (LCubic): fj(x, u) = (l(x) + u)3 We note that the first two settings (LAdd and LMult) satisfy Assumption 5.6, but the remaining settings do not. To enable pointwise comparisons of the conditional distributions, we rely on discrete noise distributions with finite support. In addition, we include two real-world settings: College admission (GPA) and credit scoring (Credit). For GPA we assume the causal graph in [Harris et al., 2022] and fit a linear SCM with additive Gaussian noise on the dataset [Open Intro, 2020]; For Credit we rely on the graph by Chen et al. [2023] and fit a random-forest based, nonlinear SCM with additive Gaussian noise [Yeh, 2009]. The Cred setting has eleven features; all others include one cause and one effect variable. To highlight the differences between methods, we choose the costs such that interventions on effects are more lucrative. Consequently, CE and CR intervene on the effect, while ICR only intervenes on the cause. To allow both sources of invalidity to come into effect, we always model the noise to stay the same post-recourse. We provide a detailed description of setup and results in Appendix F. Figure 3: Experimental results. (Q1, left): The pointwise differences between preand post-recourse conditional distribution aggregated using the mean, the lines indicate the range. All values are averages over 10 runs. (Q2, right): The difference in acceptance rate (refit minus original), average ( ) and standard deviation (lines) over 10 runs. While CE and CR lead to unfavorable shifts and performative invalidity, ICR is performatively valid in all settings. |
| Researcher Affiliation | Academia | Gunnar König1,2, Hidde Fokkema3, Timo Freiesleben4,5, Celestine Mendler-Dünner1,6, Ulrike von Luxburg1,2 1Tübingen AI Center, 2University of Tübingen 3Korteweg-de Vries Institute for Mathematics, University of Amsterdam 4LMU Munich, 5Munich Center for Machine Learning (MCML) 6ELLIS Institute Tübingen, Max Planck Institute for Intelligent Systems, Tübingen |
| Pseudocode | No | The paper describes the recourse methods (CE, CR, ICR) conceptually with mathematical definitions but does not provide any explicit pseudocode or algorithm blocks. |
| Open Source Code | Yes | All code is publicly available via Git Hub1. 1https://github.com/gcskoenig/performative-recourse-experiments |
| Open Datasets | Yes | In addition, we include two real-world settings: College admission (GPA) and credit scoring (Credit). For GPA we assume the causal graph in [Harris et al., 2022] and fit a linear SCM with additive Gaussian noise on the dataset [Open Intro, 2020]; For Credit we rely on the graph by Chen et al. [2023] and fit a random-forest based, nonlinear SCM with additive Gaussian noise [Yeh, 2009]. The Cred setting has eleven features; all others include one cause and one effect variable. ... Open Intro. Sat and gpa data set, 2020. URL https://www.openintro.org/data/index. php?data=satgpa. ... I-Cheng Yeh. Default of Credit Card Clients. UCI Machine Learning Repository, 2009. |
| Dataset Splits | Yes | To obtain the updated model and evaluate the impact of the update on acceptance rates, we split the sample of recourse implementing individuals in half. The first half is used to fit the updated model, the second half to evaluate the respective acceptance rate. ... As a result, the updated model is fitted on one third post-recourse samples and two thirds pre-recourse samples. |
| Hardware Specification | Yes | We ran the experiments on a Mac Book Pro with M3 Pro Chip and a cluster with Intel Xeon Gold processors with 16 cores and 2.9GHz. |
| Software Dependencies | No | We use sklearn to fit the model. ... we employ evolutionary algorithms (as proposed by Dandl et al. [2020]) and rely on the python package deap [Fortin et al., 2012]. The paper mentions software such as 'sklearn', 'python', and 'deap' but does not provide specific version numbers for these dependencies. |
| Experiment Setup | Yes | Decision model We use a decision tree with default hyperparamters for the discrete settings and a logistic regression with default hyperparamters in the real-world setting. We use sklearn to fit the model. ... Optimization To solve the optimization problems imposed by each of the recourse methods, we employ evolutionary algorithms (as proposed by Dandl et al. [2020]) and rely on the python package deap [Fortin et al., 2012]. ... We chose the population size 25, the number of generations 25, the crossing probability 0.5, and the mutation probability 0.5. ... To select the individuals that make it to the next generation, we rely on the following fitness function: cost(x, a) + λ(tr psuccess(x, a)) where psuccess is the probability of a favorable outcome for the given recourse methods (as defined in Section 3 and Appendix B), and λ = 104. |