Combining Diverse Feature Priors
Authors: Saachi Jain, Dimitris Tsipras, Aleksander Madry
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we explore the design space of leveraging such feature priors by viewing them as distinct perspectives on the data. Specifically, we find that models trained with diverse sets of feature priors have less overlapping failure modes, and can thus be combined more effectively. Moreover, we demonstrate that jointly training such models on additional (unlabeled) data allows them to correct each other s mistakes, which, in turn, leads to better generalization and resilience to spurious correlations. 2 |
| Researcher Affiliation | Academia | Saachi Jain * 1 Dimitris Tsipras * 1 Aleksander M adry 1 1MIT. Correspondence to: Saachi Jain <saachij@mit.edu>, Dimitris Tsipras <tsipras@mit.edu>. |
| Pseudocode | Yes | Algorithm 1 Self-Training... Algorithm 2 Standard Co-Training |
| Open Source Code | Yes | Code available at https://github.com/Madry Lab/copriors. |
| Open Datasets | Yes | We train models on a small subset (100 examples per class) of the CIFAR-10 (Krizhevsky, 2009) and STL10 (Coates et al., 2011) datasets... We also create two datasets that each contain a different spurious correlation. Tinted STL-10... Biased Celeb A (Liu et al., 2015). |
| Dataset Splits | Yes | Specifically, we treat a small fraction of the training set as labeled examples (100 examples per class), another fraction as our validation set for tuning hyperparameters (10% of the total training examples), and the rest as unlabeled data. |
| Hardware Specification | Yes | All our experiments are performed using our internal cluster which mainly consists of NVIDIA 1080 Ti GTX GPUs. |
| Software Dependencies | No | The paper does not explicitly list specific software dependencies with version numbers (e.g., Python version, specific deep learning framework like PyTorch/TensorFlow with versions). |
| Experiment Setup | Yes | We train all our models using stochastic gradient descent (SGD) with momentum (a coefficient of 0.9) and a decaying learning rate. We add weight decay regularization with a coefficient of 10 4. In terms of data augmentation, we apply random cropping with a padding of 4 pixels, random horizontal flips, and a random rotation of 2 degrees. ... We train all models with a batch size of 64 for 96 96-sized images and 128 for 32 32-sized images for a total of 300 epochs. ... The parameters chosen are shown in Table 11. |