Learning The Difference That Makes A Difference With Counterfactually-Augmented Data

Authors: Divyansh Kaushik, Eduard Hovy, Zachary Lipton

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this exploratory paper, we design a human-in-the-loop system for counterfactually manipulating documents. Our hope is that by intervening only upon the factor of interest, we might disentangle the spurious and non-spurious associations, yielding classifiers that hold up better when spurious associations do not transport out of domain. We employ crowd workers not to label documents, but rather to edit them, manipulating the text to make a targeted (counterfactual) class applicable. For sentiment analysis, we direct the worker to revise this negative movie review to make it positive, without making any gratuitous changes. We might regard the second part of this directive as a least action principle, ensuring that we perturb only those spans necessary to alter the applicability of the label. For NLI, a 3-class classification task (entailment, contradiction, neutral), we ask the workers to modify the premise while keeping the hypothesis intact, and vice versa, collecting edits corresponding to each of the (two) counterfactual classes. Using this platform, we collect thousands of counterfactually-manipulated examples for both sentiment analysis and NLI, extending the IMDb (Maas et al., 2011) and SNLI (Bowman et al., 2015) datasets, respectively. The result is two new datasets (each an extension of a standard resource) that enable us to both probe fundamental properties of language and train classifiers less reliant on spurious signal. We show that classifiers trained on original IMDb reviews fail on counterfactually-revised data and vice versa. We further show that spurious correlations in these datasets are even picked up by linear models. However, augmenting the revised examples breaks up these correlations (e.g., genre ceases to be predictive of sentiment). For a Bidirectional LSTM (Graves & Schmidhuber, 2005) trained on IMDb reviews, classification accuracy goes down from 79.3% to 55.7% when evaluated on original vs revised reviews. The same classifier trained on revised reviews achieves an accuracy of 89.1% on revised reviews compared to 62.5% on their original counterparts. These numbers go to 81.7% and 92.0% on original and revised data, respectively, when the classifier is retrained on the combined dataset. Similar patterns are observed for linear classifiers.
Researcher Affiliation Academia Divyansh Kaushik, Eduard Hovy, Zachary C. Lipton Carnegie Mellon University Pittsburgh PA, USA {dkaushik, hovy, zlipton}@cmu.edu
Pseudocode No The paper describes its data collection and model architectures in text and provides figures of the annotation platform and pipeline, but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No Both datasets are publicly available1. 1https://github.com/dkaushik96/counterfactually-augmented-data. The provided link refers to the datasets collected in the paper, not the source code for the methodology or models used.
Open Datasets Yes Using this platform, we collect thousands of counterfactually-manipulated examples for both sentiment analysis and NLI, extending the IMDb (Maas et al., 2011) and SNLI (Bowman et al., 2015) datasets, respectively. The result is two new datasets (each an extension of a standard resource) that enable us to both probe fundamental properties of language and train classifiers less reliant on spurious signal. Both datasets are publicly available1. 1https://github.com/dkaushik96/counterfactually-augmented-data. The original IMDb dataset consists of 50k reviews divided equally across train and test splits. To keep the task of editing from growing unwieldy, we filter out the longest 20% of reviews, leaving 20k reviews in the train split from which we randomly sample 2.5k reviews, enforcing a 50:50 class balance. Following revision by the crowd workers, we partition this dataset into train/validation/test splits containing 1707, 245 and 488 examples, respectively. We randomly sampled 1750, 250, and 500 pairs from the train, validation, and test sets of SNLI respectively, constraining the new data to have balanced classes. RP and RH, each comprised of 3332 pairs in train, 400 in validation, and 800 in test, leading to a total of 6664 pairs in train, 800 in validation, and 1600 in test in the revised dataset.
Dataset Splits Yes Following revision by the crowd workers, we partition this dataset into train/validation/test splits containing 1707, 245 and 488 examples, respectively. (IMDb) RP and RH, each comprised of 3332 pairs in train, 400 in validation, and 800 in test, leading to a total of 6664 pairs in train, 800 in validation, and 1600 in test in the revised dataset. (SNLI)
Hardware Specification Yes We fine-tune BERT up to 20 epochs with same early stopping criteria as for Bi-LSTM, using the BERT Adam optimizer with a batch size of 16 (to fit on a Tesla V-100 GPU).
Software Dependencies No We use scikit-learn (Pedregosa et al., 2011) implementations of SVMs and Na ıve Bayes for sentiment analysis. We train all models for a maximum of 20 epochs using Adam (Kingma & Ba, 2015). We use an off-the-shelf uncased BERT Base model, fine-tuning for each task.3 3https://github.com/huggingface/pytorch-transformers. While several software tools are mentioned, specific version numbers for these software components (e.g., scikit-learn version, PyTorch-transformers version) are not provided.
Experiment Setup Yes When training Bi-LSTMs for sentiment analysis, we restrict the vocabulary to the most frequent 20k tokens, replacing out-of-vocabulary tokens by UNK. We fix the maximum input length at 300 tokens and pad smaller reviews. Each token is represented by a randomly-initialized 50-dimensional embedding. Our model consists of a bidirectional LSTM (hidden dimension 50) with recurrent dropout (probability 0.5) and global max-pooling following the embedding layer. To generate output, we feed this (fixed-length) representation through a fully-connected hidden layer with Re LU (Nair & Hinton, 2010) activation (hidden dimension 50), and then a fully-connected output layer with softmax activation. We train all models for a maximum of 20 epochs using Adam (Kingma & Ba, 2015), with a learning rate of 1e 3 and a batch size of 32. We apply early stopping when validation loss does not decrease for 5 epochs. ELMo-LSTM: The module outputs a 1024-dimensional weighted sum of representations from the 3 Bi-LSTM layers used in ELMo. We represent each word by a 128-dimensional embedding concatenated to the resulting 1024-dimensional ELMo representation, leading to a 1152-dimensional hidden representation. Following Batch Normalization, this is passed through an LSTM (hidden size 128) with recurrent dropout (probability 0.2). The output from this LSTM is then passed to a fully-connected output layer with softmax activation. We train this model for up to 20 epochs with same early stopping criteria as for Bi-LSTM, using the Adam optimizer with a learning rate of 1e 3 and a batch size of 32. BERT: We use an off-the-shelf uncased BERT Base model, fine-tuning for each task. To account for BERT s sub-word tokenization, we set the maximum token length is set at 350 for sentiment analysis and 50 for NLI. We fine-tune BERT up to 20 epochs with same early stopping criteria as for Bi-LSTM, using the BERT Adam optimizer with a batch size of 16 (to fit on a Tesla V-100 GPU). We found learning rates of 5e 5 and 1e 5 to work best for sentiment analysis and NLI respectively.