Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Beyond One-Hot Labels: Semantic Mixing for Model Calibration
Authors: Haoyang Luo, Linwei Tao, Minjing Dong, Chang Xu
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results demonstrate that CSM achieves superior calibration compared to the state-of-the-art calibration approaches. Our code is available here. We conduct experiments with DNNs of different architectures, including Res Net-50/101 (He et al., 2016), Wide-Res Net-26-10 (Zagoruyko, 2016), and Dense Net-121 (Huang et al., 2017). We adopt CIFAR-10, CIFAR-100 (Krizhevsky et al., 2009), and Tiny-Image Net (Le & Yang, 2015) for calibration performance and out-of-distribution (OOD) robustness comparisons. |
| Researcher Affiliation | Academia | 1Department of Computer Science, City University of Hong Kong 2School of Computer Science, University of Sydney. Correspondence to: Minjing Dong <EMAIL>. |
| Pseudocode | No | The paper describes the methodology using mathematical equations and descriptive text, but it does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code is available here. |
| Open Datasets | Yes | We adopt CIFAR-10, CIFAR-100 (Krizhevsky et al., 2009), and Tiny-Image Net (Le & Yang, 2015) for calibration performance and out-of-distribution (OOD) robustness comparisons. |
| Dataset Splits | Yes | CIFAR-10: By default, the dataset is split as 50, 000, 5, 000, and 5, 000 samples for training, validation, and testing. CIFAR-100: The split is similar to CIFAR-10 with 50, 000 for training, 5, 000 for validation, and 5, 000 for testing. |
| Hardware Specification | Yes | We run CSM with a single RTX A4000 device. Specifically, it takes less than 4h for a single A4000 GPU to generate the amount of all CIFAR-10 or CIFAR-100 augmented samples, while using less than 8h for the same computing units to generate for Tiny-Image Net. |
| Software Dependencies | No | The paper describes the use of existing code and checkpoints (Karras et al., 2022a; Wang et al., 2023b) for generating augmented samples, but it does not specify any particular software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions) used for their experimental setup. |
| Experiment Setup | Yes | We conduct 200-epoch training on CIFAR-10 and CIFAR-100 and 100-epoch training on Tiny-Image Net using the SGD optimizer of momentum set to 0.9. We adopt multi-step learning rate schedule which decreases from 0.1 to 0.01 and 0.001 at epochs 81 and 121 for CIFAR-10/100, or epochs 40 and 60 for Tiny-Image Net, respectively. The weight decay is set to 5e-4. We select the scaling hyperparameter s as 4.0 for CIFAR-10/Tiny-Image Net and 2.3 for CIFAR-100 using their validation sets. |