Combining Diverse Feature Priors

Authors: Saachi Jain, Dimitris Tsipras, Aleksander Madry

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we explore the design space of leveraging such feature priors by viewing them as distinct perspectives on the data. Specifically, we find that models trained with diverse sets of feature priors have less overlapping failure modes, and can thus be combined more effectively. Moreover, we demonstrate that jointly training such models on additional (unlabeled) data allows them to correct each other s mistakes, which, in turn, leads to better generalization and resilience to spurious correlations. 2
Researcher Affiliation Academia Saachi Jain * 1 Dimitris Tsipras * 1 Aleksander M adry 1 1MIT. Correspondence to: Saachi Jain <saachij@mit.edu>, Dimitris Tsipras <tsipras@mit.edu>.
Pseudocode Yes Algorithm 1 Self-Training... Algorithm 2 Standard Co-Training
Open Source Code Yes Code available at https://github.com/Madry Lab/copriors.
Open Datasets Yes We train models on a small subset (100 examples per class) of the CIFAR-10 (Krizhevsky, 2009) and STL10 (Coates et al., 2011) datasets... We also create two datasets that each contain a different spurious correlation. Tinted STL-10... Biased Celeb A (Liu et al., 2015).
Dataset Splits Yes Specifically, we treat a small fraction of the training set as labeled examples (100 examples per class), another fraction as our validation set for tuning hyperparameters (10% of the total training examples), and the rest as unlabeled data.
Hardware Specification Yes All our experiments are performed using our internal cluster which mainly consists of NVIDIA 1080 Ti GTX GPUs.
Software Dependencies No The paper does not explicitly list specific software dependencies with version numbers (e.g., Python version, specific deep learning framework like PyTorch/TensorFlow with versions).
Experiment Setup Yes We train all our models using stochastic gradient descent (SGD) with momentum (a coefficient of 0.9) and a decaying learning rate. We add weight decay regularization with a coefficient of 10 4. In terms of data augmentation, we apply random cropping with a padding of 4 pixels, random horizontal flips, and a random rotation of 2 degrees. ... We train all models with a batch size of 64 for 96 96-sized images and 128 for 32 32-sized images for a total of 300 epochs. ... The parameters chosen are shown in Table 11.