Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Learning Actionable Counterfactual Explanations in Large State Spaces

Authors: Keziah Naggita, Matthew Walter, Avrim Blum

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive empirical evaluations using publicly available healthcare datasets (BRFSS, Foods, and NHANES) and fully-synthetic data. For negatively classified agents identified by linear and threshold-based binary classifiers, we compare the proposed forms of recourse to low-level CFEs, which suggest how the agent can transition from state x to a new state x where the model prediction is desirable. We also extensively evaluate the effectiveness of our neural network-based, data-driven CFE generation approaches. Empirical results show that the proposed data-driven CFE generators are accurate and resource-efficient, and the proposed forms of recourse offer various advantages over the low-level CFEs.
Researcher Affiliation Academia Keziah Naggita EMAIL Toyota Technological Institute at Chicago Matthew R. Walter EMAIL Toyota Technological Institute at Chicago Avrim Blum EMAIL Toyota Technological Institute at Chicago
Pseudocode Yes Algorithm 1: The agent hl-discrete CFE dataset augmentation
Open Source Code Yes Our code can be accessed here.
Open Datasets Yes We conduct extensive empirical evaluations using publicly available healthcare datasets (BRFSS, Foods, and NHANES) and fully-synthetic data. We extracted the Foods dataset from USDA, Agricultural Research Service, Nutrient Data Laboratory (2016); Awram (2024) and the BMI (body mass index) and WHR (waist-to-hip ratio) datasets from NHANES body measurement surveys (CDC, 1999; ICPSR at the University of Michigan, 2024)... The BRFSS dataset. We extracted the Behavioral Risk Factor Surveillance System (BRFSS) dataset from Teboul (2024); Centers for Disease Control and Prevention (2024).
Dataset Splits Yes We split all datasets into an 80/20 ratio for training and testing. At the end of data preprocessing, for example, removing missing data and ensuring that selected nutritional intake features were a subset of the intersectional ones, the Foods dataset contained 3901 food items. At the end of data preprocessing, we did the 80/20 train/test data split resulting in 40734 data points in the predictive training set and 10184 in the predictive testing set. Lastly, after removing the duplicate health risk agents and splitting the whole dataset 80/20, we had 11039 data points in the predictive training set and 2760 in the predictive testing set.
Hardware Specification Yes We conducted all experiments on a laptop with a CPU featuring the following hardware specifications: a 2.6 GHz 6-Core Intel Core i7 processor, 16 GB of 2400 MHz DDR4 RAM, and an Intel UHD Graphics 630 with 1536 MB of video memory.
Software Dependencies No In all cases where we implement Equations 1, 2 and 3, we use the CVXPY Python package (Diamond & Boyd, 2016; Agrawal et al., 2018).
Experiment Setup Yes Specifically, the generator is a neural network model with three hidden layers, each containing 2,000 neurons. The model incorporates ℓ2 regularization, dropout, and batch normalization. The training process uses the Adam optimizer (Kingma & Ba, 2015) with early stopping, restoring the best weights after a patience level of 300. The model trains with a batch size of 6,000 for an average of 5,000 epochs. On average, we used 500 training epochs with a batch size of 128, a dropout rate of 0.4, a learning rate of 0.0005, and either the mean squared error or binary cross-entropy loss as the objective function. Given the agent hl-id CFEs training dataset, we design the data-driven hl-id CFE generator as a neural network model with an average of two hidden layers, each consisting of 2000 neurons, ℓ2 regularization, dropout, and batch normalization. We used the Adam optimizer (Kingma & Ba, 2015) and implemented early stopping and restoration of the best weights after a patience level of 360. On average, we set the batch size to 2000 and the number of epochs set to 3000.