reproducibilityindex.ai

Adversarial Filters of Dataset Biases

Authors: Ronan Le Bras, Swabha Swayamdipta, Chandra Bhagavatula, Rowan Zellers, Matthew Peters, Ashish Sabharwal, Yejin Choi

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We present extensive supporting evidence that AFLITE is broadly applicable for reduction of measurable dataset biases, and that models trained on the ﬁltered datasets yield better generalization to out-of-distribution tasks. We present experiments under a synthetic setting, to evaluate whether AFLITE successfully removes examples with spurious correlations from a dataset. As our ﬁrst real-world data evaluation for AFLITE, we consider out-of-domain and in-domain generalization for a variety of language datasets. We evaluate AFLITE on image classiﬁcation through Image Net (ILSVRC2012) classiﬁcation.
Researcher Affiliation	Collaboration	1Allen Institute for Artiﬁcial Intelligence 2Paul G. Allen School of Computer Science, University of Washington.
Pseudocode	Yes	Algorithm 1 AFLITE Input: dataset D = (X, Y ), pre-computed representation Φ(X), model family M, target dataset size n, number of random partitions m, training set size t < n, slice size k n, early-stopping threshold Output: reduced dataset S S = D while \|S\| > n do
Open Source Code	Yes	Code & data at https://github.com/allenai/aflite-public All datasets and code for this work are publicly available.
Open Datasets	Yes	natural language inference (SNLI; Bowman et al., 2015), and question answering (SQu AD; Rajpurkar et al., 2016). Multi NLI (Williams et al., 2018), and the QNLI dataset (Wang et al., 2018a) Image Net (ILSVRC2012) classiﬁcation.
Dataset Splits	Yes	Table 3 shows the results for SNLI. In all cases, applying AFLITE substantially reduces overall model accuracy, with typical drops of 15-35% depending on the models used for learning the feature representations and those used for evaluation of the ﬁltered dataset. Training set size 550k 92k 138k 109k 92k -458k. We evaluate AFLITE on image classiﬁcation through Image Net (ILSVRC2012) classiﬁcation. For evaluation, the Imagenet-AFLITE ﬁltered validation set is much harder than the standard validation set (also see Figure 1).
Hardware Specification	No	Computations on beaker.org were supported in part by credits from Google Cloud.
Software Dependencies	No	No specific software versions (e.g., Python 3.8, PyTorch 1.9) are provided in the paper. Mentions 'scikit-learn' without a version.
Experiment Setup	Yes	Algorithm 1 provides an implementation of AFLITE. The algorithm takes as input a dataset D = (X, Y ), a representation Φ(X) we are interested in minimizing the bias in, a model family M (e.g., linear classiﬁers), a target dataset size n, size m of the support of the expectation in Eq. (4), training set size t for the classiﬁers, size k of each slice, and an early-stopping ﬁltering threshold . Appendix A.5 provides details of hyperparameters used across different experimental settings, to be discussed in the following sections.