reproducibilityindex.ai

Learning Stable Classifiers by Transferring Unstable Features

Authors: Yujia Bao, Shiyu Chang, Dr.Regina Barzilay

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate our method on both text and image classifications. Empirical results demonstrate that our algorithm is able to maintain robustness on the target task for both synthetically generated environments and real-world environments. Our code is available at https://github.com/Yujia Bao/Tofu.
Researcher Affiliation	Academia	Yujia Bao 1 Shiyu Chang 2 Regina Barzilay 1. 1MIT CSAIL 2Computer Science, UC Santa Barbara. Correspondence to: Yujia Bao <yujia@csail.mit.edu>.
Pseudocode	No	Our overall transfer paradigm is depicted in Figure 2. It consists of two steps: inferring unstable features from the source task (Section 3.1) and learning stable correlations for the target task (Section 3.2).
Open Source Code	Yes	Our code is available at https://github.com/Yujia Bao/Tofu.
Open Datasets	Yes	We consider four datasets: MNIST (Le Cun et al., 1998), Beer Review (Mc Auley et al., 2012), ASK2ME (Bao et al., 2019a) and Waterbird (Sagawa et al., 2019). In MNIST and Beer Review, we inject spurious feature to the input (background color for MNIST and pseudo token for Beer Review). In ASK2ME and Waterbird, spurious feature corresponds to an attribute of the input (breast cancer for ASK2ME and background for Waterbird). We study Celeb A (Liu et al., 2015a) where each input (an image of a human face) is annotated with 40 binary attributes.
Dataset Splits	Yes	For each dataset, we consider multiple tasks and study the transfer between these tasks. Specifically, for each task, we split its data into four environments: Etrain 1 , Etrain 2 , Eval, Etest, where spurious correlations vary across the two training environments Etrain 1 , Etrain 2 . [...] Training environments are constructed from training split, with 7370 examples per environment for EVEN and 7625 examples per environment for ODD. Validation data and testing data is constructed based on the testing split. For EVEN, both validation data and testing data have 1230 examples. For ODD, the number is 1267.
Hardware Specification	Yes	We use our internal clusters (24 NVIDIA RTX A6000 and 16 Tesla V100-PCIE-32GB) for the experiments.
Software Dependencies	No	We use Adam (Kingma & Ba, 2014) to optimize the parameters and tune the learning rate {10 3, 10 4}. For simplicity, we train all methods without data augmentation. Following Sagawa et al. (2019), we apply strong regularizations to avoid over-fitting. Specifically, we tune the dropout rate {0.1, 0.3, 0.5} for text classification datasets (Beer review and ASK2ME) and tune the weight decay parameters {10 1, 10 2, 10 3} for image datasets (MNIST, Waterbird and Celeb A).
Experiment Setup	Yes	We use batch size 50 and evaluate the validation performance every 100 batch. We apply early stopping once the validation performance hasn’t improved in the past 20 evaluations. We use Adam (Kingma & Ba, 2014) to optimize the parameters and tune the learning rate {10 3, 10 4}. For simplicity, we train all methods without data augmentation. Following Sagawa et al. (2019), we apply strong regularizations to avoid over-fitting. Specifically, we tune the dropout rate {0.1, 0.3, 0.5} for text classification datasets (Beer review and ASK2ME) and tune the weight decay parameters {10 1, 10 2, 10 3} for image datasets (MNIST, Waterbird and Celeb A).