reproducibilityindex.ai

Making a (Counterfactual) Difference One Rationale at a Time

Authors: Mitchell Plyler, Michael Green, Min Chi

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The effectiveness of CDA is empirically evaluated by comparing against several baselines including an improved MMI-based rationale schema [19] on two multi-aspect datasets. Our results show that CDA produces rationales that better capture the signal of interest.
Researcher Affiliation	Collaboration	Mitchell Plyler Department of Computer Science North Carolina State University mlplyler@ncsu.edu Michael Green Laboratory for Analytic Sciences magree22@ncsu.edu Min Chi Department of Computer Science North Carolina State University mchi@ncsu.edu
Pseudocode	No	No explicit pseudocode or algorithm blocks were found.
Open Source Code	Yes	Our software is publicly released 1. 1github.com/mlplyler/CFs_for_Rationales
Open Datasets	Yes	We conduct experiments using datasets from two sources. This ﬁrst source contains reviews compiled by Wang et al. [29] from Trip Advisor.com. We use the training, dev, and test sets curated by Bao et al. [2] and used for rationalization by Chang et al. [5]. The second source consists of reviews collected by Mc Auley et al. [24] from Rate Beer.
Dataset Splits	Yes	We use the training, dev, and test sets curated by Bao et al. [2] and used for rationalization by Chang et al. [5]. For all of the data sets and models, we use the dev set for early stopping (more details in Appendix Section A.3).
Hardware Specification	No	Appendix Section A.4 shows our server conﬁgurations and more details on our experiment setup.
Software Dependencies	No	All models are in Tensorﬂow [1].
Experiment Setup	Yes	For the rationale selectors, following [5], we set the rationale percentage to 10% for all datasets. We train the rationale selector and the classiﬁer together, early stop based on the selector cost, freeze the selector, and ﬁnally ﬁne-tune the classiﬁer on the original dataset. Additionally, we train two additional rationale models with different random seeds and the selected hyperparameters. The parameters and checkpoints for the CF Predictor models are tuned and chosen to maximize the accuracy of the training documents predicted label as compared to the target label (measured by the original rationale model) and to maximize the entropy in the inserted counterfactual tokens. The CF Predictor model is chosen from a grid search, using only the training dataset, across λA and λRL.