Making a (Counterfactual) Difference One Rationale at a Time
Authors: Mitchell Plyler, Michael Green, Min Chi
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The effectiveness of CDA is empirically evaluated by comparing against several baselines including an improved MMI-based rationale schema [19] on two multi-aspect datasets. Our results show that CDA produces rationales that better capture the signal of interest. |
| Researcher Affiliation | Collaboration | Mitchell Plyler Department of Computer Science North Carolina State University mlplyler@ncsu.edu Michael Green Laboratory for Analytic Sciences magree22@ncsu.edu Min Chi Department of Computer Science North Carolina State University mchi@ncsu.edu |
| Pseudocode | No | No explicit pseudocode or algorithm blocks were found. |
| Open Source Code | Yes | Our software is publicly released 1. 1github.com/mlplyler/CFs_for_Rationales |
| Open Datasets | Yes | We conduct experiments using datasets from two sources. This first source contains reviews compiled by Wang et al. [29] from Trip Advisor.com. We use the training, dev, and test sets curated by Bao et al. [2] and used for rationalization by Chang et al. [5]. The second source consists of reviews collected by Mc Auley et al. [24] from Rate Beer. |
| Dataset Splits | Yes | We use the training, dev, and test sets curated by Bao et al. [2] and used for rationalization by Chang et al. [5]. For all of the data sets and models, we use the dev set for early stopping (more details in Appendix Section A.3). |
| Hardware Specification | No | Appendix Section A.4 shows our server configurations and more details on our experiment setup. |
| Software Dependencies | No | All models are in Tensorflow [1]. |
| Experiment Setup | Yes | For the rationale selectors, following [5], we set the rationale percentage to 10% for all datasets. We train the rationale selector and the classifier together, early stop based on the selector cost, freeze the selector, and finally fine-tune the classifier on the original dataset. Additionally, we train two additional rationale models with different random seeds and the selected hyperparameters. The parameters and checkpoints for the CF Predictor models are tuned and chosen to maximize the accuracy of the training documents predicted label as compared to the target label (measured by the original rationale model) and to maximize the entropy in the inserted counterfactual tokens. The CF Predictor model is chosen from a grid search, using only the training dataset, across λA and λRL. |