Changing the Training Data Distribution to Reduce Simplicity Bias Improves In-distribution Generalization
Authors: Dang Nguyen, Paymon Haddad, Eric Gan, Baharan Mirzasoleiman
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we take the first steps towards addressing the above problem. To do so, we rely on recent results in non-convex optimization, showing the superior generalization performance of sharpness-aware-minimization (SAM) [22] over (stochastic) gradient descent (GD). ... To address the above question, we first theoretically analyze the dynamics of training a two-layer convolutional neural network (CNN) with SAM and compare it with that of GD. We rigorously prove that SAM learns different features in a more uniform speed compared to GD, particularly early in training. ... We show the effectiveness of USEFUL in alleviating the simplicity bias and improving the generalization via extensive experiments. |
| Researcher Affiliation | Academia | Tuan Hai Dang Nguyen Paymon Haddad Eric Gan Baharan Mirzasoleiman Department of Computer Science, UCLA |
| Pseudocode | Yes | Algorithm 1 Up Sample Early For Uniform Learning (USEFUL) |
| Open Source Code | Yes | We also provide our code for reproducing the experimental results. |
| Open Datasets | Yes | We used common datasets for image classification including CIFAR10, CIFAR100 [41], STL10 [13], CINIC10 [16], and Tiny-Image Net [43]. |
| Dataset Splits | Yes | The CIFAR10 dataset [41] consists of 60,000 32 32 color images in 10 classes, with 6000 images per class. The CIFAR100 dataset [41] is just like the CIFAR10, except it has 100 classes containing 600 images each. For both of these datasets, the training set has 50,000 images (5,000 per class for CIFAR10 and 500 per class for CIFAR100) with the test set having 10,000 images. ... The dataset consists of 500 training images, 50 validation images, and 50 test images per class. |
| Hardware Specification | Yes | Each model is trained on 1 NVIDIA RTX A5000 GPU. |
| Software Dependencies | No | The paper mentions using 'official Pytorch [54] implementation' but does not specify a version number for PyTorch or other software dependencies. |
| Experiment Setup | Yes | We trained our models for 200 epochs with a batch size of 128 and used basic data augmentations such as random mirroring and random crop. We used SGD with the momentum parameter of 0.9 and set weight decay to 0.0005. We also fixed ρ = 0.1 for SAM unless further specified. For all datasets, we used a learning rate schedule where we set the initial learning rate to 0.1. The learning rate is decayed by a factor of 10 after 50% and 75% epochs, i.e., we set the learning rate to 0.01 after 100 epochs and to 0.001 after 150 epochs. |