Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

The Utility of “Even if” Semifactual Explanation to Optimise Positive Outcomes

Authors: Eoin Kenny, Weipeng Huang

NeurIPS 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Tests on benchmark datasets show our algorithms are better at maximising gain compared to prior work, and that causality is important in the process. Most importantly however, a user study supports our main hypothesis by showing people find semifactual explanations more useful than counterfactuals when they receive the positive outcome of a loan acceptance. 5 Experiments & Results Here we test S-GEN in both causal and non-causal settings. We show the effectiveness of our method in optimising a user s positive outcome compared to baselines and open source our code (see Appendix E).
Researcher Affiliation	Collaboration	Eoin M. Kenny Massachusetts Institute of Technology Cambridge, MA, U.S.A. EMAIL Weipeng Huang Tencent Security Big Data Lab Shenzhen, Guangdong Province, China EMAIL
Pseudocode	Yes	D Algorithm Pseudocode Algorithm 1 S-GEN: Genetic Algorithm to Generate semifactual Recourse with Robustness and Diversity in a Non-Causal Model Agnostic Setting
Open Source Code	Yes	Contributed Equally. 1Code available at: https://github.com/Eoin Kenny/Semifactual_Recourse_Generation
Open Datasets	Yes	In the non-causal setting, we consider three datasets, Loan Application [33], German Credit [21], and BCSC [11]. In the causal setting, the Adult [31] and COMPAS [5] datasets are considered.
Dataset Splits	No	The paper describes using '30 random test data point explanation samples' and running '30 averaged samples from 5 random seeds' for evaluation. While a test set is implied, explicit percentages or counts for training, validation, and testing splits, or reference to a standard split with a citation, are not provided for the models' training.
Hardware Specification	Yes	All tests were run on a Mac Book Pro, Apple M1 Pro, 16 GB.
Software Dependencies	No	The paper does not explicitly list specific software dependencies with version numbers (e.g., Python 3.x, PyTorch 1.x, CUDA x.x). While code is provided, the paper text itself does not detail these dependencies.
Experiment Setup	Yes	C Hyperparameter Choices In this section, we discuss the hyperparameter specifications for the causal and non-causal cases respectively. Table 1: Hyperparameter Specifications. The number of generations spent searching for a solution was 20. The population size was fixed at {12, 24, 48, 72, 96, 120}, for diversity sizes of {1, 2, 4, 6, 8, 10}, respectively. The mutation rate was 0.05. The number of elite" solutions passed on for each generation was 4. The probability of a crossover happening was 0.5. The number of Monte Carlo trials for each instance was 100.