Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

PAC-Bayes Analysis for Recalibration in Classification

Authors: Masahiro Fujisawa, Futoshi Futami

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Numerical experiments show that our algorithm enhances the performance of Gaussian process-based recalibration across various benchmark datasets and models.
Researcher Affiliation	Academia	1The University of Osaka, Osaka, Japan 2RIKEN Center for Advanced Intelligence Project, Tokyo, Japan. Correspondence to: Masahiro Fujisawa <EMAIL>, Futoshi Futami <EMAIL>.
Pseudocode	Yes	Algorithm 1 PAC-Bayes recalibration (PBR)
Open Source Code	No	The paper mentions adapting code from Wenger et al. (2020) and obtaining models from a GitHub project for CIFAR-100 experiments, but it does not provide an explicit statement of releasing their own implementation code for the methodology described in this paper. The provided links are for third-party tools or existing models they used, not their original code.
Open Datasets	Yes	We primarily report results from multiclass classification experiments on the MNIST (Le Cun et al., 1989) and CIFAR100 (Krizhevsky, 2009) datasets. Additional experimental results are provided in Appendix E, and details of our experimental settings, including model specifications, are summarized in Appendix D. For ECE evaluation, we set B = n1/3 re based on our findings. Table 2. Datasets used in our experiments Dataset Classes Train data (ntr) Recalibration data (nre) Test data (nte) KITTI (Geiger, 2012) 2 16000 1000 8000 PCam (Veeling et al., 2018) 2 22768 1000 9000 MNIST (Le Cun ets., 1989) 10 60000 1000 9000 CIFAR-100 (Krizhevsky, 2009) 100 50000 1000 9000
Dataset Splits	Yes	Table 2. Datasets used in our experiments Dataset Classes Train data (ntr) Recalibration data (nre) Test data (nte) KITTI (Geiger, 2012) 2 16000 1000 8000 PCam (Veeling et al., 2018) 2 22768 1000 9000 MNIST (Le Cun et al., 1989) 10 60000 1000 9000 CIFAR-100 (Krizhevsky, 2009) 100 50000 1000 9000 ... We conducted 10-fold cross-validation for recalibration function training and reported the mean and standard deviation of these two performance metrics. For ECE evaluation, we set B = n1/3 re based on our findings.
Hardware Specification	Yes	Our CIFAR-100 experiments were conducted on NVIDIA GPUs with 32GB memory (NVIDIA DGX-1 with Tesla V100 and DGX-2). For the other experiments, we used CPU (Apple M1) with 16GB memory.
Software Dependencies	No	The paper mentions various software components and frameworks like "XGBoost (Chen and Guestrin, 2016)", "Random Forests (Breiman, 2001)", "Gaussian process (GP) (Rasmussen and Williams, 2005)", "temperature scaling (Guo et al., 2017)", and "GP calibration (Wenger et al., 2020)". However, it does not specify explicit version numbers for these software packages or libraries (e.g., PyTorch version, Python version, specific library versions).
Experiment Setup	Yes	PBR has an optimizable parameter 1/λ in addition to the posterior parameter; thus, we selected it using grid search from the following candidates: {0., 0.01, 0.1, 0.25, 0.5, 0.75, 0.9, 1.0}. ... We also used J = 100 Monte Carlo samples from posterior ρ to obtain V. ... For ECE evaluation, we set B = n1/3 re based on our findings.