Combining Ensembles and Data Augmentation Can Harm Your Calibration

Authors: Yeming Wen, Ghassen Jerfel, Rafael Muller, Michael W Dusenberry, Jasper Snoek, Balaji Lakshminarayanan, Dustin Tran

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we show a surprising pathology: combining ensembles and data augmentation can harm model calibration. This leads to a trade-off in practice, whereby improved accuracy by combining the two techniques comes at the expense of calibration. On the other hand, selecting only one of the techniques ensures good uncertainty estimates at the expense of accuracy. We investigate this pathology and identify a compounding under-confidence among methods which marginalize over sets of weights and data augmentation techniques which soften labels. Finally, we propose a simple correction, achieving the best of both worlds with significant accuracy and calibration gains over using only ensembles or data augmentation individually. Applying the correction produces new state-of-the art in uncertainty calibration across CIFAR-10, CIFAR-100, and Image Net.
Researcher Affiliation Collaboration Yeming Wen 1, Ghassen Jerfel 2, Rafael Muller2, Michael W. Dusenberry2, Jasper Snoek2, Balaji Lakshminarayanan2 & Dustin Tran2 Equal contribution, 1University of Texas, Austin, 2Google Brain
Pseudocode Yes Algorithm 1 Forgetting Count Based CAMixup
Open Source Code Yes Code: https://github.com/google/edward2/tree/master/ experimental/marginalization_mixup.
Open Datasets Yes CIFAR & CIFAR-C: We consider two CIFAR datasets, CIFAR-10 and CIFAR-100 (Krizhevsky, 2009). Each consists of a training set of size 50K and a test set of size 10K. They are natural images with 32x32 pixels. Each class has 5,000 training images and 500 training images on CIFAR-10 and CIFAR-100 respectively. [...] Image Net & Image Net-C: We used the ILSVRC 2012 classification dataset (Deng et al., 2009) which consists of a total of 1.2 million training images, 50,000 validation images and 150,000 testing images. Images span over 1,000 classes.
Dataset Splits Yes If a training method requires validation dataset such as CAMixup, we use separate 2, 500 images from 50K training images as the validation set.
Hardware Specification No No specific hardware details (e.g., CPU/GPU models, memory) are mentioned for the experiments.
Software Dependencies No No specific software dependencies with version numbers are listed.
Experiment Setup Yes We kept the same set of hyperparameters as the Batch Ensemble model in Wen et al. (2020). All hyperparameters can be found in Table 3. The most sensitive hyperparameter we found is whether to use ensemble batch norm, which applies a separate batch norm layer for each ensemble member; and the value of random_sign_init, which controls the standard deviation of Gaussian distributed initialization of s and r. We kept Batch Ensemble CIFAR-10 the same as Wen et al. (2020), which does not deploy ensemble batch norm. We enable ensemble batch norm on CIFAR-100 and Image Net. This allows us to use larger standard deviation in the initialization. The random_sign_init is 0.5 on CIFAR-10 and 0.75 on CIFAR-100 and -0.75 on Image Net. In the code, we use negative value to denote the standard deviation of Gaussian distribution (positive value instead initializes with a Bernoulli distribution under that probability). In our case, we only use negative random_sign_init, which means we only consider Gaussian distributed initialization in this work. Table 3: Hyperparameters we used in Section 3 regarding to Batch Ensemble. The difference between CIFAR-10 and CIFAR-100 is l2, random_sign_init and whether to use Sync Ensemble_BN.