When and How Mixup Improves Calibration

Authors: Linjun Zhang, Zhun Deng, Kenji Kawaguchi, James Zou

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we theoretically prove that Mixup improves calibration in high-dimensional settings by investigating natural statistical models. Interestingly, the calibration benefit of Mixup increases as the model capacity increases. We support our theories with experiments on common architectures and datasets.
Researcher Affiliation Academia 1Rutgers University 2Harvard University 3National University of Singapore 4Stanford University.
Pseudocode Yes Algorithm 1 The pseudo-labeling algorithm
Open Source Code No The paper does not provide a specific link or explicit statement about the availability of open-source code for the described methodology.
Open Datasets Yes We used the standard data sets CIFAR-10 and CIFAR-100 (Krizhevsky & Hinton, 2009)... We adopted the standard data sets, Kuzushiji-MNIST (Clanuwat et al., 2019), Fashion-MNIST (Xiao et al., 2017), and CIFAR-10 and CIFAR-100 (Krizhevsky & Hinton, 2009).
Dataset Splits No The paper mentions data augmentation but does not explicitly state specific training, validation, or test dataset splits (e.g., percentages or sample counts) for reproducibility, relying on
Hardware Specification Yes We run experiments with a machine with 10-Core 3.30 GHz Intel Core i9-9820X and four NVIDIA RTX 2080 Ti GPUs with 11 GB GPU memory.
Software Dependencies No The paper mentions software like "torchvision.transforms" and "SGD" but does not provide specific version numbers for any software dependencies or frameworks.
Experiment Setup Yes For the experiments on the effect of the width, we fixed the depth to be 8 and varied the width from 10 to 3000. For the experiments on the effect of the depth, the depth was varied from 1 to 24 (i.e., from 3 to 26 layers including input/output layers) by fixing the width to be 400 with data-augmentation and 80 without data-augmentation. We used stochastic gradient descent (SGD) with mini-batch size of 64. We set the learning rate to be 0.01 and momentum coefficient to be 0.9. We used the Beta distribution Beta(α, α) with α = 1.0 for Mixup.