Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Beyond One-Hot Labels: Semantic Mixing for Model Calibration

Authors: Haoyang Luo, Linwei Tao, Minjing Dong, Chang Xu

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate that CSM achieves superior calibration compared to the state-of-the-art calibration approaches. Our code is available here. We conduct experiments with DNNs of different architectures, including Res Net-50/101 (He et al., 2016), Wide-Res Net-26-10 (Zagoruyko, 2016), and Dense Net-121 (Huang et al., 2017). We adopt CIFAR-10, CIFAR-100 (Krizhevsky et al., 2009), and Tiny-Image Net (Le & Yang, 2015) for calibration performance and out-of-distribution (OOD) robustness comparisons.
Researcher Affiliation Academia 1Department of Computer Science, City University of Hong Kong 2School of Computer Science, University of Sydney. Correspondence to: Minjing Dong <EMAIL>.
Pseudocode No The paper describes the methodology using mathematical equations and descriptive text, but it does not include any structured pseudocode or algorithm blocks.
Open Source Code Yes Our code is available here.
Open Datasets Yes We adopt CIFAR-10, CIFAR-100 (Krizhevsky et al., 2009), and Tiny-Image Net (Le & Yang, 2015) for calibration performance and out-of-distribution (OOD) robustness comparisons.
Dataset Splits Yes CIFAR-10: By default, the dataset is split as 50, 000, 5, 000, and 5, 000 samples for training, validation, and testing. CIFAR-100: The split is similar to CIFAR-10 with 50, 000 for training, 5, 000 for validation, and 5, 000 for testing.
Hardware Specification Yes We run CSM with a single RTX A4000 device. Specifically, it takes less than 4h for a single A4000 GPU to generate the amount of all CIFAR-10 or CIFAR-100 augmented samples, while using less than 8h for the same computing units to generate for Tiny-Image Net.
Software Dependencies No The paper describes the use of existing code and checkpoints (Karras et al., 2022a; Wang et al., 2023b) for generating augmented samples, but it does not specify any particular software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions) used for their experimental setup.
Experiment Setup Yes We conduct 200-epoch training on CIFAR-10 and CIFAR-100 and 100-epoch training on Tiny-Image Net using the SGD optimizer of momentum set to 0.9. We adopt multi-step learning rate schedule which decreases from 0.1 to 0.01 and 0.001 at epochs 81 and 121 for CIFAR-10/100, or epochs 40 and 60 for Tiny-Image Net, respectively. The weight decay is set to 5e-4. We select the scaling hyperparameter s as 4.0 for CIFAR-10/Tiny-Image Net and 2.3 for CIFAR-100 using their validation sets.