Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
When and How Mixup Improves Calibration
Authors: Linjun Zhang, Zhun Deng, Kenji Kawaguchi, James Zou
ICML 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we theoretically prove that Mixup improves calibration in high-dimensional settings by investigating natural statistical models. Interestingly, the calibration benefit of Mixup increases as the model capacity increases. We support our theories with experiments on common architectures and datasets. |
| Researcher Affiliation | Academia | 1Rutgers University 2Harvard University 3National University of Singapore 4Stanford University. |
| Pseudocode | Yes | Algorithm 1 The pseudo-labeling algorithm |
| Open Source Code | No | The paper does not provide a specific link or explicit statement about the availability of open-source code for the described methodology. |
| Open Datasets | Yes | We used the standard data sets CIFAR-10 and CIFAR-100 (Krizhevsky & Hinton, 2009)... We adopted the standard data sets, Kuzushiji-MNIST (Clanuwat et al., 2019), Fashion-MNIST (Xiao et al., 2017), and CIFAR-10 and CIFAR-100 (Krizhevsky & Hinton, 2009). |
| Dataset Splits | No | The paper mentions data augmentation but does not explicitly state specific training, validation, or test dataset splits (e.g., percentages or sample counts) for reproducibility, relying on |
| Hardware Specification | Yes | We run experiments with a machine with 10-Core 3.30 GHz Intel Core i9-9820X and four NVIDIA RTX 2080 Ti GPUs with 11 GB GPU memory. |
| Software Dependencies | No | The paper mentions software like "torchvision.transforms" and "SGD" but does not provide specific version numbers for any software dependencies or frameworks. |
| Experiment Setup | Yes | For the experiments on the effect of the width, we fixed the depth to be 8 and varied the width from 10 to 3000. For the experiments on the effect of the depth, the depth was varied from 1 to 24 (i.e., from 3 to 26 layers including input/output layers) by fixing the width to be 400 with data-augmentation and 80 without data-augmentation. We used stochastic gradient descent (SGD) with mini-batch size of 64. We set the learning rate to be 0.01 and momentum coefficient to be 0.9. We used the Beta distribution Beta(α, α) with α = 1.0 for Mixup. |