Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Model Patching: Closing the Subgroup Performance Gap with Data Augmentation
Authors: Karan Goel, Albert Gu, Yixuan Li, Christopher Re
ICLR 2021 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct an extensive empirical study (Section 4) that validates Cycle GAN Augmented Model Patching (CAMEL) s ability to improve subgroup invariance and robustness. We first evaluate CAMEL on a controlled MNIST setup, where it cuts robust error rate to a third of other approaches while learning representations that are far more invariant, as measured by mutual information estimates. On two machine learning benchmarks Celeb A and Waterbirds, CAMEL consistently outperforms state-of-the-art approaches that rely on robust optimization, with reductions in subgroup performance gap by up to 10%. Next, we perform ablations on each stage of our framework |
| Researcher Affiliation | Academia | Karan Goel , Albert Gu , Sharon Li, Christopher Ré Department of Computer Science, Stanford University EMAIL |
| Pseudocode | No | The paper describes algorithms and components in prose and with mathematical equations but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks or figures. |
| Open Source Code | Yes | Code for reproducing our results is available on Git Hub. |
| Open Datasets | Yes | We mix data from MNIST (Le Cun et al., 1998) and MNIST-Corrupted (Mu & Gilmer, 2019) to create a controlled setup. Following Sagawa et al. (2020), we classify hair color Y {non-blonde, blonde} in the Celeb A dataset (Liu et al., 2015). In this dataset to analyze spurious correlations (Sagawa et al., 2020), birds Y {landbird, waterbird} are placed against image backgrounds Z {land, water}. In this skin cancer dataset (Codella et al., 2018), we classify Y {benign, malignant} cancers |
| Dataset Splits | Yes | For validation, we use 50% of the training data. Table 8: Number of training, validation and test examples in each dataset. |
| Hardware Specification | No | The paper mentions training 'All classifiers are fine-tuned using a Res Net-50 architecture, with pretrained Image Net weights' and that 'All training code is written in Python with tensorflow-2.0,' but it does not provide specific details about the CPU, GPU models, memory, or any other hardware specifications used for the experiments. |
| Software Dependencies | Yes | All training code is written in Python with tensorflow-2.0. We use the default hyperparameters suggested by (Zhu et al., 2017) for Cycle GAN training. |
| Experiment Setup | Yes | All models are trained with Stochastic Gradient Descent (SGD), with a momentum of 0.9. All models are fine-tuned using a Res Net-50 architecture, with pretrained Image Net weights. The only preprocessing common to all methods is standard Image Net normalization using µ = [0.485, 0.456, 0.406], σ = [0.229, 0.224, 0.225]. We use the default hyperparameters suggested by (Zhu et al., 2017) for Cycle GAN training, with batchnorm for layer normalization. We use Adam for optimization (β1 = 0.5) with a constant learning rate of 0.0002 for both generators and both discriminators. Section D.4 and Table 12 provide extensive details on hyperparameter sweeps and selected values for learning rates, weight decays, adjustment coefficients, and consistency penalties. |