Explaining the Efficacy of Counterfactually Augmented Data
Authors: Divyansh Kaushik, Amrith Setlur, Eduard H Hovy, Zachary Chase Lipton
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Thus, we present a large-scale empirical study comparing spans edited to create CAD to those selected by attention and saliency maps. Across numerous challenge domains and models, we find that the hypothesized phenomenon is pronounced for CAD. Through a series of large-scale empirical experiments addressing sentiment analysis and natural language inference (NLI) tasks, we inject noise on the spans marked as causal vs non-causal. |
| Researcher Affiliation | Academia | Divyansh Kaushik, Amrith Setlur, Eduard Hovy, Zachary C. Lipton Carnegie Mellon University Pittsburgh, PA, USA {dkaushik, asetlur, hovy, zlipton}@cmu.edu |
| Pseudocode | No | No pseudocode or algorithm blocks were found within the paper's content. |
| Open Source Code | No | No explicit statement or specific link to the source code implementing the paper's core methodology or analysis was provided. The paper primarily uses and adapts existing open-source tools and datasets. |
| Open Datasets | Yes | We conduct experiments on sentiment analysis (Zaidan et al., 2007; Kaushik et al., 2020) and NLI (De Young et al., 2020). All datasets can be found at https://github.com/acmi-lab/counterfactually-augmented-data |
| Dataset Splits | No | While the paper mentions using a 'validation set' for hyperparameter tuning and early stopping ('We identify parameters for both classifiers using grid search conducted over the validation set.' and 'We apply early stopping when validation loss does not decrease for 5 epochs.'), it does not explicitly specify the exact percentages or absolute counts for the training, validation, and test splits for all datasets used in the experiments. |
| Hardware Specification | Yes | We fine-tune BERT up to 20 epochs... with a batch size of 16 (to fit on a 16GB Tesla V-100 GPU). We fine-tune Longformer for 10 epochs... using a batch size of 8 (to fit on 64GB of GPU memory). |
| Software Dependencies | No | The paper mentions using 'scikit-learn' and 'off-the-shelf uncased BERT Base and Longformer Base models (Wolf et al., 2019),' but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | We train all models for a maximum of 20 epochs using Adam (Kingma & Ba, 2015), with a learning rate of 1e 4 and a batch size of 16. We apply early stopping when validation loss does not decrease for 5 epochs. We fine-tune BERT up to 20 epochs with same early stopping criteria as for Bi LSTM, using the BERT Adam optimizer with a batch size of 16... We found learning rates of 5e 5 and 1e 5 to work best for sentiment analysis and NLI respectively. We fine-tune Longformer for 10 epochs with early stopping, using a batch size of 8... |