Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Where are we with calibration under dataset shift in image classification?

Authors: Mélanie Roschewitz, Raghav Mehta, Fabio De Sousa Ribeiro, Ben Glocker

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct an extensive study on the state of calibration under real-world dataset shift for image classification. Our work provides important insights on the choice of post-hoc and intraining calibration techniques, and yields practical guidelines for all practitioners interested in robust calibration under shift. We compare various post-hoc calibration methods, and their interactions with common in-training calibration strategies (e.g., label smoothing), across a wide range of natural shifts, on eight different classification tasks across several imaging domains.
Researcher Affiliation	Academia	Mélanie Roschewitz EMAIL Imperial College London Raghav Mehta EMAIL Imperial College London Fabio De Sousa Ribeiro EMAIL Imperial College London Ben Glocker EMAIL Imperial College London
Pseudocode	Yes	A.5.1 EBS: Energy-based calibration, implementation details In this section, we detail the algorithm proposed by Kim & Kwon (2024) for their energy-based calibration method. Algorithm 1: Energy-based calibration (Kim & Kwon, 2024) ... A.5.2 Enhancing other calibrators with OOD exposure ... Algorithm 2: Temperature scaling with semantic OOD
Open Source Code	Yes	1Our code is publicly available at https://github.com/biomedia-mira/calibration_under_shifts
Open Datasets	Yes	We analyse calibration robustness across different image classification tasks and real-world natural distribution shifts. First, we investigate robustness against geographic and sub-population shifts, in natural image classification using the Living17 and Entity30 datasets (Santurkar et al., 2020), as well as Wildsi Cam (Koh et al., 2021). Next, we analyse realistic shifts specific to medical imaging... using the EMBED (Jeong et al., 2023) mammography dataset, (ii) scanner, population and prevalence changes in chest X-ray classification (No Finding / Diseased) using Che Xpert (Irvin et al., 2019) and MIMIC-CXR (Johnson et al., 2019) (CXR), (iii) equipment, prevalence, and geographic location changes for diabetic retinopathy assessment models, combining multiple public fundus imaging datasets (Karthik & Sohier, 2019; Decencière et al., 2014; Dugas et al., 2015) (RETINA), and (iv) staining protocols changes in histopathology, using WILDS-Came Lyon (Koh et al., 2021). Finally, we test against hard modality shifts in natural image classification using Domain Net (Peng et al., 2019) with Real images as the ID domain.
Dataset Splits	Yes	We summarise splits and ID/shifted domains definitions in Table A.1. Table A.1: Datasets and shifts used in this study, details on ID and shifted splits definition. Dataset ... Ntrain, Nval, Ntest Living17 ... (37570, 6630, 1700) Entity30 ... (131123, 23140, 6000) CXR ... (129732, 23086, 38192) RETINA ... (35126, 10715, 42861) EMBED ... (169758, 43055, 52975) Came Lyon ... (272192, 30244, 33560) i Cam ... (129809, 7314, 8154) Domain Net ... (217630, 24182, 104082)
Hardware Specification	No	The paper does not explicitly mention specific GPU models (e.g., NVIDIA A100), CPU models (e.g., Intel Xeon), or cloud computing resources with their specifications used for running the experiments. It describes the types of models trained (from scratch or finetuned from foundation models) and general training parameters, but lacks specific hardware details.
Software Dependencies	No	The paper mentions that
Experiment Setup	Yes	All of our classification models were trained using a standard cross-entropy loss, using the Adam optimiser, with a batch size of 32. For models trained from scratch, we used a cosine learning rate schedule or a fixed learning rate schedule, depending on the architecture and the dataset, chosen based on validation performance, we detail all configurations in Table A.3. For finetuning foundation models, we used a fixed learning rate of 10-5, chosen based on validation performance. Table A.3: Learning rate schedules used to train models initialised with random weights, per dataset. Learning rate schedules were chosen based on best validation performance. In terms of regularisation strength, we chose to fix the hyperparameters for the strength of label smoothing and entropy regularisation across all datasets and experiments (λ = 0.05 for LS, α = 0.1 for ER, λ = 0.05, α = 0.1 for ER+LS).