reproducibilityindex.ai

Explaining the Efficacy of Counterfactually Augmented Data

Authors: Divyansh Kaushik, Amrith Setlur, Eduard H Hovy, Zachary Chase Lipton

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Thus, we present a large-scale empirical study comparing spans edited to create CAD to those selected by attention and saliency maps. Across numerous challenge domains and models, we ﬁnd that the hypothesized phenomenon is pronounced for CAD. Through a series of large-scale empirical experiments addressing sentiment analysis and natural language inference (NLI) tasks, we inject noise on the spans marked as causal vs non-causal.
Researcher Affiliation	Academia	Divyansh Kaushik, Amrith Setlur, Eduard Hovy, Zachary C. Lipton Carnegie Mellon University Pittsburgh, PA, USA {dkaushik, asetlur, hovy, zlipton}@cmu.edu
Pseudocode	No	No pseudocode or algorithm blocks were found within the paper's content.
Open Source Code	No	No explicit statement or specific link to the source code implementing the paper's core methodology or analysis was provided. The paper primarily uses and adapts existing open-source tools and datasets.
Open Datasets	Yes	We conduct experiments on sentiment analysis (Zaidan et al., 2007; Kaushik et al., 2020) and NLI (De Young et al., 2020). All datasets can be found at https://github.com/acmi-lab/counterfactually-augmented-data
Dataset Splits	No	While the paper mentions using a 'validation set' for hyperparameter tuning and early stopping ('We identify parameters for both classiﬁers using grid search conducted over the validation set.' and 'We apply early stopping when validation loss does not decrease for 5 epochs.'), it does not explicitly specify the exact percentages or absolute counts for the training, validation, and test splits for all datasets used in the experiments.
Hardware Specification	Yes	We ﬁne-tune BERT up to 20 epochs... with a batch size of 16 (to ﬁt on a 16GB Tesla V-100 GPU). We ﬁne-tune Longformer for 10 epochs... using a batch size of 8 (to ﬁt on 64GB of GPU memory).
Software Dependencies	No	The paper mentions using 'scikit-learn' and 'off-the-shelf uncased BERT Base and Longformer Base models (Wolf et al., 2019),' but does not provide specific version numbers for these or other software dependencies.
Experiment Setup	Yes	We train all models for a maximum of 20 epochs using Adam (Kingma & Ba, 2015), with a learning rate of 1e 4 and a batch size of 16. We apply early stopping when validation loss does not decrease for 5 epochs. We ﬁne-tune BERT up to 20 epochs with same early stopping criteria as for Bi LSTM, using the BERT Adam optimizer with a batch size of 16... We found learning rates of 5e 5 and 1e 5 to work best for sentiment analysis and NLI respectively. We ﬁne-tune Longformer for 10 epochs with early stopping, using a batch size of 8...