Explaining the Efficacy of Counterfactually Augmented Data

Authors: Divyansh Kaushik, Amrith Setlur, Eduard H Hovy, Zachary Chase Lipton

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Thus, we present a large-scale empirical study comparing spans edited to create CAD to those selected by attention and saliency maps. Across numerous challenge domains and models, we find that the hypothesized phenomenon is pronounced for CAD. Through a series of large-scale empirical experiments addressing sentiment analysis and natural language inference (NLI) tasks, we inject noise on the spans marked as causal vs non-causal.
Researcher Affiliation Academia Divyansh Kaushik, Amrith Setlur, Eduard Hovy, Zachary C. Lipton Carnegie Mellon University Pittsburgh, PA, USA {dkaushik, asetlur, hovy, zlipton}@cmu.edu
Pseudocode No No pseudocode or algorithm blocks were found within the paper's content.
Open Source Code No No explicit statement or specific link to the source code implementing the paper's core methodology or analysis was provided. The paper primarily uses and adapts existing open-source tools and datasets.
Open Datasets Yes We conduct experiments on sentiment analysis (Zaidan et al., 2007; Kaushik et al., 2020) and NLI (De Young et al., 2020). All datasets can be found at https://github.com/acmi-lab/counterfactually-augmented-data
Dataset Splits No While the paper mentions using a 'validation set' for hyperparameter tuning and early stopping ('We identify parameters for both classifiers using grid search conducted over the validation set.' and 'We apply early stopping when validation loss does not decrease for 5 epochs.'), it does not explicitly specify the exact percentages or absolute counts for the training, validation, and test splits for all datasets used in the experiments.
Hardware Specification Yes We fine-tune BERT up to 20 epochs... with a batch size of 16 (to fit on a 16GB Tesla V-100 GPU). We fine-tune Longformer for 10 epochs... using a batch size of 8 (to fit on 64GB of GPU memory).
Software Dependencies No The paper mentions using 'scikit-learn' and 'off-the-shelf uncased BERT Base and Longformer Base models (Wolf et al., 2019),' but does not provide specific version numbers for these or other software dependencies.
Experiment Setup Yes We train all models for a maximum of 20 epochs using Adam (Kingma & Ba, 2015), with a learning rate of 1e 4 and a batch size of 16. We apply early stopping when validation loss does not decrease for 5 epochs. We fine-tune BERT up to 20 epochs with same early stopping criteria as for Bi LSTM, using the BERT Adam optimizer with a batch size of 16... We found learning rates of 5e 5 and 1e 5 to work best for sentiment analysis and NLI respectively. We fine-tune Longformer for 10 epochs with early stopping, using a batch size of 8...