A General Search-Based Framework for Generating Textual Counterfactual Explanations
Authors: Daniel Gilo, Shaul Markovitch
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically show the advantage of our algorithm over state-of-the-art alternatives. Additionally, we report results of a human survey that validates the plausibility of our generated counterfactuals. |
| Researcher Affiliation | Academia | Department of Computer Science, Technion Israel Institute of Technology {danielgilo,shaulm}@cs.technion.ac.il |
| Pseudocode | Yes | Algorithm 1: TCE-SEARCH |
| Open Source Code | No | The paper does not include an unambiguous statement that the authors are releasing the source code for the methodology described in this paper, nor does it provide a direct link to a code repository for their implementation. |
| Open Datasets | Yes | We report results for 8 datasets: (1) Yelp. Yelp business reviews. (2) Amazon (Ni, Li, and Mc Auley 2019). Video game reviews on Amazon. (3) SST (Socher et al. 2013). Stanford sentiment treebank. (4) Science. Comments from Reddit on scientific subjects. (5) Genre. Movie plot descriptions labeled by genre. (6) AGNews (Zhang, Zhao, and Le Cun 2015). AG s news articles labeled by topic. (7) Airline. Tweets about flight companies. (8) Spam (Almeida, Hidalgo, and Yamakami 2011). SMS labeled either as spam or as legitimate. |
| Dataset Splits | No | The paper states: "For each dataset, we randomly sampled 200 examples to serve as the explanation test set. We randomly split the rest of the dataset, so that 80% serves as the black-box training set and 20% remains for the black-box test set." It explicitly defines training and test sets but does not specify a separate validation split or percentages for it. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used, such as CPU or GPU models, memory, or cloud instance types. |
| Software Dependencies | No | The paper mentions software components like "Distil BERT", "GPT-2", "GloVe", "scikit-learn", and "Adam W" but does not specify their version numbers, which is required for reproducibility. |
| Experiment Setup | Yes | We used the same fine-tuning procedure for both the LM and the MLM: 3 epochs with initial LR of 5e-05 and weight decay of 0.0 for Adam W. We used a batch size of 2. |