reproducibilityindex.ai

Understanding the failure modes of out-of-distribution generalization

Authors: Vaishnavh Nagarajan, Anders Andreassen, Behnam Neyshabur

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In particular, through a theoretical study of gradient-descent-trained linear classiﬁers on some easy-to-learn tasks, we uncover two complementary failure modes. These modes arise from how spurious correlations induce two kinds of skews in the data: one geometric in nature, and another, statistical in nature. Finally, we construct natural modiﬁcations of image classiﬁcation datasets to understand when these failure modes can arise in practice. We also design experiments to isolate the two failure modes when training modern neural networks on these datasets.
Researcher Affiliation	Collaboration	Vaishnavh Nagarajan Carnegie Mellon University vaishnavh@cs.cmu.edu Anders Andreassen Blueshift, Alphabet ajandreassen@google.com Behnam Neyshabur Blueshift, Alphabet neyshabur@google.com
Pseudocode	No	The paper describes mathematical formulations and experimental procedures in prose and through equations and figures. However, it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code	Yes	Code is available at https://github.com/google-research/OOD-failures
Open Datasets	Yes	Specifically, consider the following Binary-MNIST based task... (We present similar results for a CIFAR10 setting, and all experiment details in App C.1.) ...the cats vs. dogs (Elson et al., 2007) ...obesity estimation task based on the dataset from Palechor & de la Hoz Manotas (2019).
Dataset Splits	No	The paper mentions training, testing, and sometimes a combined test/validation set (e.g., 'use the remaining 5262 datapoints for testing/validation'). However, it does not specify clear train/validation/test splits with percentages or distinct sizes for a separate validation set, which would be needed for reproduction.
Hardware Specification	No	The paper mentions training on different models (e.g., 'fully-connected three-layered ReLU network', 'Res Net V1'). However, it does not specify any hardware details such as CPU/GPU models, memory, or specific computational resources used for the experiments.
Software Dependencies	No	The paper mentions using optimizers like SGD and Adam, and references a Keras example implementation ('Borrowing the implementation in https://github.com/keras-team/keras/blob/master/ examples/cifar10_resnet.py'). However, it does not provide specific version numbers for Keras, TensorFlow, PyTorch, Python, or any other software libraries, which are necessary for full reproducibility.
Experiment Setup	Yes	In all our MNIST-based experiments, we consider the Binary-MNIST classiﬁcation task... for this we train a fully-connected three-layered Re LU network with a width of 400 and using SGD with learning rate 0.1 for 50 epochs. In all our CIFAR10-based experiments... we train a Res Net V1 with a depth of 20 for 200 epochs. ...we train a linear model with no bias on the logistic loss with a learning rate of 0.001, batch size of 32 and training set size of 2048. ...train a linear classiﬁer to minimize the logistic loss using SGD with a learning rate of 0.01 and batch size of 32 for as many as 10k epochs.