When and How Mixup Improves Calibration
Authors: Linjun Zhang, Zhun Deng, Kenji Kawaguchi, James Zou
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we theoretically prove that Mixup improves calibration in high-dimensional settings by investigating natural statistical models. Interestingly, the calibration benefit of Mixup increases as the model capacity increases. We support our theories with experiments on common architectures and datasets. |
| Researcher Affiliation | Academia | 1Rutgers University 2Harvard University 3National University of Singapore 4Stanford University. |
| Pseudocode | Yes | Algorithm 1 The pseudo-labeling algorithm |
| Open Source Code | No | The paper does not provide a specific link or explicit statement about the availability of open-source code for the described methodology. |
| Open Datasets | Yes | We used the standard data sets CIFAR-10 and CIFAR-100 (Krizhevsky & Hinton, 2009)... We adopted the standard data sets, Kuzushiji-MNIST (Clanuwat et al., 2019), Fashion-MNIST (Xiao et al., 2017), and CIFAR-10 and CIFAR-100 (Krizhevsky & Hinton, 2009). |
| Dataset Splits | No | The paper mentions data augmentation but does not explicitly state specific training, validation, or test dataset splits (e.g., percentages or sample counts) for reproducibility, relying on |
| Hardware Specification | Yes | We run experiments with a machine with 10-Core 3.30 GHz Intel Core i9-9820X and four NVIDIA RTX 2080 Ti GPUs with 11 GB GPU memory. |
| Software Dependencies | No | The paper mentions software like "torchvision.transforms" and "SGD" but does not provide specific version numbers for any software dependencies or frameworks. |
| Experiment Setup | Yes | For the experiments on the effect of the width, we fixed the depth to be 8 and varied the width from 10 to 3000. For the experiments on the effect of the depth, the depth was varied from 1 to 24 (i.e., from 3 to 26 layers including input/output layers) by fixing the width to be 400 with data-augmentation and 80 without data-augmentation. We used stochastic gradient descent (SGD) with mini-batch size of 64. We set the learning rate to be 0.01 and momentum coefficient to be 0.9. We used the Beta distribution Beta(α, α) with α = 1.0 for Mixup. |