reproducibilityindex.ai

A General Search-Based Framework for Generating Textual Counterfactual Explanations

Authors: Daniel Gilo, Shaul Markovitch

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically show the advantage of our algorithm over state-of-the-art alternatives. Additionally, we report results of a human survey that validates the plausibility of our generated counterfactuals.
Researcher Affiliation	Academia	Department of Computer Science, Technion Israel Institute of Technology {danielgilo,shaulm}@cs.technion.ac.il
Pseudocode	Yes	Algorithm 1: TCE-SEARCH
Open Source Code	No	The paper does not include an unambiguous statement that the authors are releasing the source code for the methodology described in this paper, nor does it provide a direct link to a code repository for their implementation.
Open Datasets	Yes	We report results for 8 datasets: (1) Yelp. Yelp business reviews. (2) Amazon (Ni, Li, and Mc Auley 2019). Video game reviews on Amazon. (3) SST (Socher et al. 2013). Stanford sentiment treebank. (4) Science. Comments from Reddit on scientiﬁc subjects. (5) Genre. Movie plot descriptions labeled by genre. (6) AGNews (Zhang, Zhao, and Le Cun 2015). AG s news articles labeled by topic. (7) Airline. Tweets about ﬂight companies. (8) Spam (Almeida, Hidalgo, and Yamakami 2011). SMS labeled either as spam or as legitimate.
Dataset Splits	No	The paper states: "For each dataset, we randomly sampled 200 examples to serve as the explanation test set. We randomly split the rest of the dataset, so that 80% serves as the black-box training set and 20% remains for the black-box test set." It explicitly defines training and test sets but does not specify a separate validation split or percentages for it.
Hardware Specification	No	The paper does not provide specific details about the hardware used, such as CPU or GPU models, memory, or cloud instance types.
Software Dependencies	No	The paper mentions software components like "Distil BERT", "GPT-2", "GloVe", "scikit-learn", and "Adam W" but does not specify their version numbers, which is required for reproducibility.
Experiment Setup	Yes	We used the same ﬁne-tuning procedure for both the LM and the MLM: 3 epochs with initial LR of 5e-05 and weight decay of 0.0 for Adam W. We used a batch size of 2.