Understanding the failure modes of out-of-distribution generalization
Authors: Vaishnavh Nagarajan, Anders Andreassen, Behnam Neyshabur
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In particular, through a theoretical study of gradient-descent-trained linear classifiers on some easy-to-learn tasks, we uncover two complementary failure modes. These modes arise from how spurious correlations induce two kinds of skews in the data: one geometric in nature, and another, statistical in nature. Finally, we construct natural modifications of image classification datasets to understand when these failure modes can arise in practice. We also design experiments to isolate the two failure modes when training modern neural networks on these datasets. |
| Researcher Affiliation | Collaboration | Vaishnavh Nagarajan Carnegie Mellon University vaishnavh@cs.cmu.edu Anders Andreassen Blueshift, Alphabet ajandreassen@google.com Behnam Neyshabur Blueshift, Alphabet neyshabur@google.com |
| Pseudocode | No | The paper describes mathematical formulations and experimental procedures in prose and through equations and figures. However, it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | Yes | Code is available at https://github.com/google-research/OOD-failures |
| Open Datasets | Yes | Specifically, consider the following Binary-MNIST based task... (We present similar results for a CIFAR10 setting, and all experiment details in App C.1.) ...the cats vs. dogs (Elson et al., 2007) ...obesity estimation task based on the dataset from Palechor & de la Hoz Manotas (2019). |
| Dataset Splits | No | The paper mentions training, testing, and sometimes a combined test/validation set (e.g., 'use the remaining 5262 datapoints for testing/validation'). However, it does not specify clear train/validation/test splits with percentages or distinct sizes for a separate validation set, which would be needed for reproduction. |
| Hardware Specification | No | The paper mentions training on different models (e.g., 'fully-connected three-layered ReLU network', 'Res Net V1'). However, it does not specify any hardware details such as CPU/GPU models, memory, or specific computational resources used for the experiments. |
| Software Dependencies | No | The paper mentions using optimizers like SGD and Adam, and references a Keras example implementation ('Borrowing the implementation in https://github.com/keras-team/keras/blob/master/ examples/cifar10_resnet.py'). However, it does not provide specific version numbers for Keras, TensorFlow, PyTorch, Python, or any other software libraries, which are necessary for full reproducibility. |
| Experiment Setup | Yes | In all our MNIST-based experiments, we consider the Binary-MNIST classification task... for this we train a fully-connected three-layered Re LU network with a width of 400 and using SGD with learning rate 0.1 for 50 epochs. In all our CIFAR10-based experiments... we train a Res Net V1 with a depth of 20 for 200 epochs. ...we train a linear model with no bias on the logistic loss with a learning rate of 0.001, batch size of 32 and training set size of 2048. ...train a linear classifier to minimize the logistic loss using SGD with a learning rate of 0.01 and batch size of 32 for as many as 10k epochs. |