On the Foundations of Shortcut Learning
Authors: Katherine Hermann, Hossein Mobahi, Thomas FEL, Michael Curtis Mozer
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Here, we test hypotheses about which input properties are more available to a model, and systematically study how predictivity and availability interact to shape models feature use. Our empirical findings are consistent with a theoretical account based on Neural Tangent Kernels. We perform parametric studies of latent-feature predictivity and availability, and examine the sensitivity of different model architectures to shortcut bias, finding that it is greater for nonlinear models than linear models, and that model depth amplifies bias. |
| Researcher Affiliation | Industry | Google {1Deep Mind, 2Research}, Mountain View, CA, USA {hermannk, hmobahi, thomasfel, mcmozer}@google.com |
| Pseudocode | No | The paper describes experimental procedures and theoretical derivations but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide an explicit statement or link to an open-source code repository for the methodology described. While Python code snippets are present in the supplementary material, they are not presented as the released source code for the full methodology. |
| Open Datasets | Yes | We construct two datasets by sampling images from Waterbirds (Sagawa et al., 2020a) (core: Bird, non-core: Background), and Celeb A (Liu et al., 2015) (core: Attractive, non-core: Smiling). Waterbirds images (Sagawa et al., 2020a) combine birds taken from the Caltech-UCSD Birds-200-2011 dataset (Wah et al., 2011) and backgrounds from the Places dataset (Zhou et al., 2017). The Celeb A (Liu et al., 2015) dataset contains images of celebrity faces paired with a variety of binary attribute labels. |
| Dataset Splits | Yes | We sample class-balanced datasets with 3200 train instances, 1000 validation instances, and 900 probe (evaluation) instances that uniformly cover the (zs, zc) space by taking a Cartesian product of 30 zs evenly spaced in [ 3µs, +3µs] and 30 zc evenly spaced in [ 3µc, +3µc]. To construct the datasets used in Figure 6A, we sample images from a base Waterbirds dataset generated using code by (Sagawa et al., 2020a) (github.com/kohpangwei/ group_DRO) with val frac = 0.2 and confounder strength = 0.5, yielding sets of 5694 train images (224 224), 300 validation images, and 5794 test images. We then subsample these sets to 1200 train, 90 validation ( 2 sets), and 1000 probe images ( 2 sets), respectively, such that the train and validation sets instantiate target feature predictivities... |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU/CPU models, memory, or cloud computing instance types used for running experiments. |
| Software Dependencies | No | The paper mentions using "Sklearn" for Linear Discriminant Analysis and "numpy" and "matplotlib" in code snippets, but it does not specify version numbers for these software dependencies, which are necessary for full reproducibility. |
| Experiment Setup | Yes | Unless otherwise described, we train a multilayer perceptron (MLP, depth = 8, width = 128, hidden activation function = Re LU) with MSE loss for 100 epochs using an SGD optimizer with batch size = 64 and learning rate = 1e 02. We use Glorot Normal weight initialization. In these experiments, we train a randomly initialized Res Net18 architecture with MSE loss using an Adam optimizer with batch size = 64 and learning rate = 1e 02, taking the best (defined on validation accuracy) across 100 epochs of training. In the experiments shown in Figures 6C and B.9, we preprocess images by normalizing to be in [ 1, 1], apply random crops of size 200 | 200, resize to 224 | 224, and randomly flip over the horizontal axis with p = 0.5 We train models with MSE loss using an Adam optimizer with batch size = 32, cosine decay learning rate schedule (linear warmup, initial learning rate = 1e 03), and weight decay = 1e 05. For the Res Net18 experiments in Figures 6C and B.9, we train a randomly initialized Res Net18 trained for 30 epochs. For the Res Net50 experiments in Figure B.9, we train the randomly initialized readout layer of a frozen, Image Net-pretrained Res Net50 (IN Res Net50) for 15 epochs. |