Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Simple Steps to Success: A Method for Step-Based Counterfactual Explanations

Authors: Jenny Hamer, Nicholas Perello, Jason Valladares, Vignesh Viswanathan, Yair Zick

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Furthermore, via a thorough empirical and theoretical investigation, we show that St EP oﬀers provable robustness and privacy guarantees while outperforming popular methods along important metrics. We also provide an extensive experimental evaluation of St EP, including a holistic cross-comparison with three popular recourse methods (Di CE (Mothilal et al., 2020a), FACE (Poyiadzi et al., 2020), and C-CHVAE (Pawelczyk et al., 2020)) on three widely-used ﬁnancial datasets Credit Card Default (Yeh & Lien, 2009), Give Me Some Credit (Credit Fusion, 2011) and UCI Adult (Kohavi, 1996) datasets. We also investigate St EP s robustness to noise (Section 4.3). ... 4 Empirical Evaluation & Analysis
Researcher Affiliation	Collaboration	Jenny Hamer EMAIL Google Deep Mind, New York Nicholas Perello EMAIL University of Massachusetts, Amherst Jason Valladares EMAIL Google Vignesh Viswanathan EMAIL University of Massachusetts, Amherst Yair Zick EMAIL University of Massachusetts, Amherst
Pseudocode	Yes	Algorithm 1 Stepwise Explainable Paths (St EP) Require: Dataset X partitioned into k clusters {X1, . . . , Xk}, point of interest x, model f, some nonnegatively valued function α : R 0 7 R 0 1: while f( x) = 1 do 2: for every cluster c [k] do 3: Generate a direction dc for each c x Xc( x x)α( x x )1(f( x ) = 1) 4: end for 5: Oﬀer the directions {dc}c [k] to the stakeholder 6: Stakeholder returns an updated point of interest x 7: 8: end while
Open Source Code	No	The paper does not provide an explicit statement or link to the source code for the St EP methodology described. It only mentions using and adapting implementations for baseline methods.
Open Datasets	Yes	We provide an extensive experimental evaluation of St EP, including a holistic cross-comparison with three popular recourse methods (Di CE (Mothilal et al., 2020a), FACE (Poyiadzi et al., 2020), and C-CHVAE (Pawelczyk et al., 2020)) on three widely-used ﬁnancial datasets Credit Card Default (Yeh & Lien, 2009), Give Me Some Credit (Credit Fusion, 2011) and UCI Adult (Kohavi, 1996) datasets. ... Datasets and Models We employ three real-world datasets in our cross-comparison analysis: Credit Card Default (Yeh & Lien, 2009), Give Me Some Credit (Credit Fusion, 2011), and UCI Adult/Census Income (Kohavi, 1996), described in Table 1.
Dataset Splits	Yes	For each dataset, we train and validate logistic regression, random forest, and two-layer DNN model instances following a 70/15/15 training, validation, and test (recourse-time) data splits. ... Credit Card Default: we produce random train, validation, and test sets from the 30, 000 instances using a 70/15/15 split, resulting in sets with approximately 21k/4.5k/4.5k datapoints respectively.
Hardware Specification	Yes	We used two machines for our empirical evaluation, including for base model training (i.e. logistic regression, random forest, neural network), for all recourse experiments, and post-processing of results to produce metrics. These machines contain: 8 CPU cores, 64GB RAM, NVIDIA 4070 GPU, and 2TB local SSD disk memory. 16 CPU cores 64GB RAM, NVIDIA 4080S GPU, and 4TB local SSD disk memory.
Software Dependencies	No	The paper mentions software like 'sci-kit learn' and the 'Py Torch library' but does not specify version numbers for these or other key software components, which are necessary for full reproducibility.
Experiment Setup	Yes	We specify a conﬁdence threshold of 0.7 at test time for each base model to determine whether a Po I is positively classiﬁed. Implementation and hyperparameter tuning details are described in Appendix C.2. ... In our experiments, we use the volcano function αv with d = 2 and γ = 0.5. ... We perform a basic grid search across base models, recourse methods, and datasets using sci-kit learn s Grid Search CV function. To determine appropriate hyperparameters to use across the base ML models and recourse methods, we roughly optimize for a reasonably high success rate and to minimize distance. ... We vary k {1, 2, 3, 4, 5}, the number of paths to produce for each Po I (and for St EP, the number of clusters to produce), and ﬁx k = 3 for our comparative analysis and user-interference experiments. For St EP, we consider step sizes in {0.10, 0.25, 0.50, 0.75, 1.00} and ﬁx a value of 1 across all experiments. ... For C-CHVAE we set the step distance hyperparameter to 1. ... For all applicable recourse methods, we allow a maximum of 50 iterations to produce k counterfactual(s).