reproducibilityindex.ai

Model Patching: Closing the Subgroup Performance Gap with Data Augmentation

Authors: Karan Goel, Albert Gu, Yixuan Li, Christopher Re

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct an extensive empirical study (Section 4) that validates Cycle GAN Augmented Model Patching (CAMEL) s ability to improve subgroup invariance and robustness. We ﬁrst evaluate CAMEL on a controlled MNIST setup, where it cuts robust error rate to a third of other approaches while learning representations that are far more invariant, as measured by mutual information estimates. On two machine learning benchmarks Celeb A and Waterbirds, CAMEL consistently outperforms state-of-the-art approaches that rely on robust optimization, with reductions in subgroup performance gap by up to 10%. Next, we perform ablations on each stage of our framework
Researcher Affiliation	Academia	Karan Goel , Albert Gu , Sharon Li, Christopher Ré Department of Computer Science, Stanford University {kgoel,albertgu,sharonli,chrismre}@cs.stanford.edu
Pseudocode	No	The paper describes algorithms and components in prose and with mathematical equations but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks or figures.
Open Source Code	Yes	Code for reproducing our results is available on Git Hub.
Open Datasets	Yes	We mix data from MNIST (Le Cun et al., 1998) and MNIST-Corrupted (Mu & Gilmer, 2019) to create a controlled setup. Following Sagawa et al. (2020), we classify hair color Y {non-blonde, blonde} in the Celeb A dataset (Liu et al., 2015). In this dataset to analyze spurious correlations (Sagawa et al., 2020), birds Y {landbird, waterbird} are placed against image backgrounds Z {land, water}. In this skin cancer dataset (Codella et al., 2018), we classify Y {benign, malignant} cancers
Dataset Splits	Yes	For validation, we use 50% of the training data. Table 8: Number of training, validation and test examples in each dataset.
Hardware Specification	No	The paper mentions training 'All classiﬁers are ﬁne-tuned using a Res Net-50 architecture, with pretrained Image Net weights' and that 'All training code is written in Python with tensorﬂow-2.0,' but it does not provide specific details about the CPU, GPU models, memory, or any other hardware specifications used for the experiments.
Software Dependencies	Yes	All training code is written in Python with tensorﬂow-2.0. We use the default hyperparameters suggested by (Zhu et al., 2017) for Cycle GAN training.
Experiment Setup	Yes	All models are trained with Stochastic Gradient Descent (SGD), with a momentum of 0.9. All models are ﬁne-tuned using a Res Net-50 architecture, with pretrained Image Net weights. The only preprocessing common to all methods is standard Image Net normalization using µ = [0.485, 0.456, 0.406], σ = [0.229, 0.224, 0.225]. We use the default hyperparameters suggested by (Zhu et al., 2017) for Cycle GAN training, with batchnorm for layer normalization. We use Adam for optimization (β1 = 0.5) with a constant learning rate of 0.0002 for both generators and both discriminators. Section D.4 and Table 12 provide extensive details on hyperparameter sweeps and selected values for learning rates, weight decays, adjustment coefficients, and consistency penalties.