Model Patching: Closing the Subgroup Performance Gap with Data Augmentation

Authors: Karan Goel, Albert Gu, Yixuan Li, Christopher Re

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct an extensive empirical study (Section 4) that validates Cycle GAN Augmented Model Patching (CAMEL) s ability to improve subgroup invariance and robustness. We first evaluate CAMEL on a controlled MNIST setup, where it cuts robust error rate to a third of other approaches while learning representations that are far more invariant, as measured by mutual information estimates. On two machine learning benchmarks Celeb A and Waterbirds, CAMEL consistently outperforms state-of-the-art approaches that rely on robust optimization, with reductions in subgroup performance gap by up to 10%. Next, we perform ablations on each stage of our framework
Researcher Affiliation Academia Karan Goel , Albert Gu , Sharon Li, Christopher Ré Department of Computer Science, Stanford University {kgoel,albertgu,sharonli,chrismre}@cs.stanford.edu
Pseudocode No The paper describes algorithms and components in prose and with mathematical equations but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks or figures.
Open Source Code Yes Code for reproducing our results is available on Git Hub.
Open Datasets Yes We mix data from MNIST (Le Cun et al., 1998) and MNIST-Corrupted (Mu & Gilmer, 2019) to create a controlled setup. Following Sagawa et al. (2020), we classify hair color Y {non-blonde, blonde} in the Celeb A dataset (Liu et al., 2015). In this dataset to analyze spurious correlations (Sagawa et al., 2020), birds Y {landbird, waterbird} are placed against image backgrounds Z {land, water}. In this skin cancer dataset (Codella et al., 2018), we classify Y {benign, malignant} cancers
Dataset Splits Yes For validation, we use 50% of the training data. Table 8: Number of training, validation and test examples in each dataset.
Hardware Specification No The paper mentions training 'All classifiers are fine-tuned using a Res Net-50 architecture, with pretrained Image Net weights' and that 'All training code is written in Python with tensorflow-2.0,' but it does not provide specific details about the CPU, GPU models, memory, or any other hardware specifications used for the experiments.
Software Dependencies Yes All training code is written in Python with tensorflow-2.0. We use the default hyperparameters suggested by (Zhu et al., 2017) for Cycle GAN training.
Experiment Setup Yes All models are trained with Stochastic Gradient Descent (SGD), with a momentum of 0.9. All models are fine-tuned using a Res Net-50 architecture, with pretrained Image Net weights. The only preprocessing common to all methods is standard Image Net normalization using µ = [0.485, 0.456, 0.406], σ = [0.229, 0.224, 0.225]. We use the default hyperparameters suggested by (Zhu et al., 2017) for Cycle GAN training, with batchnorm for layer normalization. We use Adam for optimization (β1 = 0.5) with a constant learning rate of 0.0002 for both generators and both discriminators. Section D.4 and Table 12 provide extensive details on hyperparameter sweeps and selected values for learning rates, weight decays, adjustment coefficients, and consistency penalties.