Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Simple Steps to Success: A Method for Step-Based Counterfactual Explanations
Authors: Jenny Hamer, Nicholas Perello, Jason Valladares, Vignesh Viswanathan, Yair Zick
TMLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Furthermore, via a thorough empirical and theoretical investigation, we show that St EP offers provable robustness and privacy guarantees while outperforming popular methods along important metrics. We also provide an extensive experimental evaluation of St EP, including a holistic cross-comparison with three popular recourse methods (Di CE (Mothilal et al., 2020a), FACE (Poyiadzi et al., 2020), and C-CHVAE (Pawelczyk et al., 2020)) on three widely-used financial datasets Credit Card Default (Yeh & Lien, 2009), Give Me Some Credit (Credit Fusion, 2011) and UCI Adult (Kohavi, 1996) datasets. We also investigate St EP s robustness to noise (Section 4.3). ... 4 Empirical Evaluation & Analysis |
| Researcher Affiliation | Collaboration | Jenny Hamer EMAIL Google Deep Mind, New York Nicholas Perello EMAIL University of Massachusetts, Amherst Jason Valladares EMAIL Google Vignesh Viswanathan EMAIL University of Massachusetts, Amherst Yair Zick EMAIL University of Massachusetts, Amherst |
| Pseudocode | Yes | Algorithm 1 Stepwise Explainable Paths (St EP) Require: Dataset X partitioned into k clusters {X1, . . . , Xk}, point of interest x, model f, some nonnegatively valued function α : R 0 7 R 0 1: while f( x) = 1 do 2: for every cluster c [k] do 3: Generate a direction dc for each c x Xc( x x)α( x x )1(f( x ) = 1) 4: end for 5: Offer the directions {dc}c [k] to the stakeholder 6: Stakeholder returns an updated point of interest x 7: 8: end while |
| Open Source Code | No | The paper does not provide an explicit statement or link to the source code for the St EP methodology described. It only mentions using and adapting implementations for baseline methods. |
| Open Datasets | Yes | We provide an extensive experimental evaluation of St EP, including a holistic cross-comparison with three popular recourse methods (Di CE (Mothilal et al., 2020a), FACE (Poyiadzi et al., 2020), and C-CHVAE (Pawelczyk et al., 2020)) on three widely-used financial datasets Credit Card Default (Yeh & Lien, 2009), Give Me Some Credit (Credit Fusion, 2011) and UCI Adult (Kohavi, 1996) datasets. ... Datasets and Models We employ three real-world datasets in our cross-comparison analysis: Credit Card Default (Yeh & Lien, 2009), Give Me Some Credit (Credit Fusion, 2011), and UCI Adult/Census Income (Kohavi, 1996), described in Table 1. |
| Dataset Splits | Yes | For each dataset, we train and validate logistic regression, random forest, and two-layer DNN model instances following a 70/15/15 training, validation, and test (recourse-time) data splits. ... Credit Card Default: we produce random train, validation, and test sets from the 30, 000 instances using a 70/15/15 split, resulting in sets with approximately 21k/4.5k/4.5k datapoints respectively. |
| Hardware Specification | Yes | We used two machines for our empirical evaluation, including for base model training (i.e. logistic regression, random forest, neural network), for all recourse experiments, and post-processing of results to produce metrics. These machines contain: 8 CPU cores, 64GB RAM, NVIDIA 4070 GPU, and 2TB local SSD disk memory. 16 CPU cores 64GB RAM, NVIDIA 4080S GPU, and 4TB local SSD disk memory. |
| Software Dependencies | No | The paper mentions software like 'sci-kit learn' and the 'Py Torch library' but does not specify version numbers for these or other key software components, which are necessary for full reproducibility. |
| Experiment Setup | Yes | We specify a confidence threshold of 0.7 at test time for each base model to determine whether a Po I is positively classified. Implementation and hyperparameter tuning details are described in Appendix C.2. ... In our experiments, we use the volcano function αv with d = 2 and γ = 0.5. ... We perform a basic grid search across base models, recourse methods, and datasets using sci-kit learn s Grid Search CV function. To determine appropriate hyperparameters to use across the base ML models and recourse methods, we roughly optimize for a reasonably high success rate and to minimize distance. ... We vary k {1, 2, 3, 4, 5}, the number of paths to produce for each Po I (and for St EP, the number of clusters to produce), and fix k = 3 for our comparative analysis and user-interference experiments. For St EP, we consider step sizes in {0.10, 0.25, 0.50, 0.75, 1.00} and fix a value of 1 across all experiments. ... For C-CHVAE we set the step distance hyperparameter to 1. ... For all applicable recourse methods, we allow a maximum of 50 iterations to produce k counterfactual(s). |