Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Meta-Calibration: Learning of Model Calibration Using Differentiable Expected Calibration Error

Authors: Ondrej Bohdal, Yongxin Yang, Timothy Hospedales

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments show that DECE-driven meta-learning can be used to obtain excellent calibration across a variety of benchmarks and models. We experiment with CIFAR-10 and CIFAR-100 benchmarks (Krizhevsky, 2009), SVHN (Netzer et al., 2011) and 20 Newsgroups dataset (Lang, 1995), covering both computer vision and NLP. We show the test ECE, test ECE after temperature scaling (TS) and test error rates in Tables 1, 2 and 3 respectively. Meta-Calibration leads to excellent intrinsic calibration without the need for post-processing (Table 1)... We perform the additional evaluation using Res Net18 on the CIFAR benchmark.
Researcher Affiliation	Collaboration	Ondrej Bohdal EMAIL The University of Edinburgh Yongxin Yang EMAIL Queen Mary University of London Timothy Hospedales EMAIL The University of Edinburgh Samsung AI Center Cambridge
Pseudocode	Yes	Algorithm 1 Meta-Calibration
Open Source Code	No	The paper states: "We extend the implementation provided by (Mukhoti et al., 2020) to implement and evaluate our meta-learning approach." This indicates building upon existing code, but does not explicitly state that the authors of this paper are releasing their own implementation or provide a link to it.
Open Datasets	Yes	Datasets and settings We experiment with CIFAR-10 and CIFAR-100 benchmarks (Krizhevsky, 2009), SVHN (Netzer et al., 2011) and 20 Newsgroups dataset (Lang, 1995), covering both computer vision and NLP.
Dataset Splits	Yes	CIFAR and SVHN models are trained for 350 epochs... 90% of the original training set is used for training and 10% for validation. In the case of meta-learning, we create a further separate meta-validation set that is of size 10% of the original training data, so we directly train with 80% of the original training data. 20 Newsgroups models are trained with Adam optimiser with the default parameters, 128 minibatch size and for 50 epochs. As the final model we select the checkpoint with the best validation accuracy.
Hardware Specification	Yes	Table 8: Training times in hours. We used one NVIDIA Titan X for each experiment.
Software Dependencies	No	The paper mentions optimizers like Adam (Kingma & Ba, 2015) but does not provide specific version numbers for any software libraries or environments used to implement the methodology.
Experiment Setup	Yes	CIFAR and SVHN models are trained for 350 epochs, with a multi-step scheduler that decreases the initial learning rate of 0.1 by a factor of 10 after 150 and 250 epochs. Each model is trained with SGD with momentum of 0.9, weight decay of 0.0005 and minibatch size of 128. ... For DECE, we use M = 15 bins and scaling parameters τa = 100, τb = 0.01. Learnable label smoothing coefficients are optimised using Adam (Kingma & Ba, 2015) optimiser with learning rate of 0.001. ... We use λ = 0.5 in the meta-objective...