Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Beyond One-Hot Labels: Semantic Mixing for Model Calibration

Authors: Haoyang Luo, Linwei Tao, Minjing Dong, Chang Xu

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results demonstrate that CSM achieves superior calibration compared to the state-of-the-art calibration approaches. Our code is available here. We conduct experiments with DNNs of different architectures, including Res Net-50/101 (He et al., 2016), Wide-Res Net-26-10 (Zagoruyko, 2016), and Dense Net-121 (Huang et al., 2017). We adopt CIFAR-10, CIFAR-100 (Krizhevsky et al., 2009), and Tiny-Image Net (Le & Yang, 2015) for calibration performance and out-of-distribution (OOD) robustness comparisons.
Researcher Affiliation	Academia	1Department of Computer Science, City University of Hong Kong 2School of Computer Science, University of Sydney. Correspondence to: Minjing Dong <EMAIL>.
Pseudocode	No	The paper describes the methodology using mathematical equations and descriptive text, but it does not include any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Our code is available here.
Open Datasets	Yes	We adopt CIFAR-10, CIFAR-100 (Krizhevsky et al., 2009), and Tiny-Image Net (Le & Yang, 2015) for calibration performance and out-of-distribution (OOD) robustness comparisons.
Dataset Splits	Yes	CIFAR-10: By default, the dataset is split as 50, 000, 5, 000, and 5, 000 samples for training, validation, and testing. CIFAR-100: The split is similar to CIFAR-10 with 50, 000 for training, 5, 000 for validation, and 5, 000 for testing.
Hardware Specification	Yes	We run CSM with a single RTX A4000 device. Specifically, it takes less than 4h for a single A4000 GPU to generate the amount of all CIFAR-10 or CIFAR-100 augmented samples, while using less than 8h for the same computing units to generate for Tiny-Image Net.
Software Dependencies	No	The paper describes the use of existing code and checkpoints (Karras et al., 2022a; Wang et al., 2023b) for generating augmented samples, but it does not specify any particular software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions) used for their experimental setup.
Experiment Setup	Yes	We conduct 200-epoch training on CIFAR-10 and CIFAR-100 and 100-epoch training on Tiny-Image Net using the SGD optimizer of momentum set to 0.9. We adopt multi-step learning rate schedule which decreases from 0.1 to 0.01 and 0.001 at epochs 81 and 121 for CIFAR-10/100, or epochs 40 and 60 for Tiny-Image Net, respectively. The weight decay is set to 5e-4. We select the scaling hyperparameter s as 4.0 for CIFAR-10/Tiny-Image Net and 2.3 for CIFAR-100 using their validation sets.