Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Meta-Calibration: Learning of Model Calibration Using Differentiable Expected Calibration Error
Authors: Ondrej Bohdal, Yongxin Yang, Timothy Hospedales
TMLR 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments show that DECE-driven meta-learning can be used to obtain excellent calibration across a variety of benchmarks and models. We experiment with CIFAR-10 and CIFAR-100 benchmarks (Krizhevsky, 2009), SVHN (Netzer et al., 2011) and 20 Newsgroups dataset (Lang, 1995), covering both computer vision and NLP. We show the test ECE, test ECE after temperature scaling (TS) and test error rates in Tables 1, 2 and 3 respectively. Meta-Calibration leads to excellent intrinsic calibration without the need for post-processing (Table 1)... We perform the additional evaluation using Res Net18 on the CIFAR benchmark. |
| Researcher Affiliation | Collaboration | Ondrej Bohdal EMAIL The University of Edinburgh Yongxin Yang EMAIL Queen Mary University of London Timothy Hospedales EMAIL The University of Edinburgh Samsung AI Center Cambridge |
| Pseudocode | Yes | Algorithm 1 Meta-Calibration |
| Open Source Code | No | The paper states: "We extend the implementation provided by (Mukhoti et al., 2020) to implement and evaluate our meta-learning approach." This indicates building upon existing code, but does not explicitly state that the authors of this paper are releasing their own implementation or provide a link to it. |
| Open Datasets | Yes | Datasets and settings We experiment with CIFAR-10 and CIFAR-100 benchmarks (Krizhevsky, 2009), SVHN (Netzer et al., 2011) and 20 Newsgroups dataset (Lang, 1995), covering both computer vision and NLP. |
| Dataset Splits | Yes | CIFAR and SVHN models are trained for 350 epochs... 90% of the original training set is used for training and 10% for validation. In the case of meta-learning, we create a further separate meta-validation set that is of size 10% of the original training data, so we directly train with 80% of the original training data. 20 Newsgroups models are trained with Adam optimiser with the default parameters, 128 minibatch size and for 50 epochs. As the final model we select the checkpoint with the best validation accuracy. |
| Hardware Specification | Yes | Table 8: Training times in hours. We used one NVIDIA Titan X for each experiment. |
| Software Dependencies | No | The paper mentions optimizers like Adam (Kingma & Ba, 2015) but does not provide specific version numbers for any software libraries or environments used to implement the methodology. |
| Experiment Setup | Yes | CIFAR and SVHN models are trained for 350 epochs, with a multi-step scheduler that decreases the initial learning rate of 0.1 by a factor of 10 after 150 and 250 epochs. Each model is trained with SGD with momentum of 0.9, weight decay of 0.0005 and minibatch size of 128. ... For DECE, we use M = 15 bins and scaling parameters τa = 100, τb = 0.01. Learnable label smoothing coefficients are optimised using Adam (Kingma & Ba, 2015) optimiser with learning rate of 0.001. ... We use λ = 0.5 in the meta-objective... |