Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Meta-Cal: Well-controlled Post-hoc Calibration by Ranking
Authors: Xingchen Ma, Matthew B. Blaschko
ICML 2021 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical results on CIFAR-10, CIFAR-100 and Image Net and a range of popular network architectures show our proposed method significantly outperforms the current state of the art for post-hoc multi-class classification calibration. |
| Researcher Affiliation | Academia | 1ESAT-PSI, KU Leuven, Belgium. Correspondence to: Xingchen Ma <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Meta-Cal (miscoverage control) and Algorithm 2 Meta-Cal (coverage accuracy control). |
| Open Source Code | Yes | Code is available at https://github.com/maxc01/ metacal |
| Open Datasets | Yes | For CIFAR-10 and CIFAR-100, the following networks are used: Dense Net (Huang et al., 2016a), Res Net (He et al., 2015), Res Net with stochastic depth (Huang et al., 2016b), Wide Res Net (Zagoruyko & Komodakis, 2016). 45000 out of 60000 images are used for training these classifiers. The remaining 15000 images are held out for training and evalu ating post-hoc calibration methods. For Image Net, we use pre-trained Dense Net-161 and Res Net-152 from Py Torch (Paszke et al., 2019). |
| Dataset Splits | Yes | The remaining 15000 images are held out for training and evalu ating post-hoc calibration methods. The training details are given in Supplement C. These 15000 samples are randomly split into 5000/10000 samples to train and evaluate a posthoc calibration method. For Image Net, we use pre-trained Dense Net-161 and Res Net-152 from Py Torch (Paszke et al., 2019). 50000 images in the validation set are used for train ing and evaluating post-hoc calibration methods. To train and test a calibration map, we randomly split these samples into 25000/25000 images. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., CPU, GPU models, or memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions using PyTorch but does not specify its version or any other software dependencies with version numbers. |
| Experiment Setup | Yes | The experimental configurations specific to our proposed ap proach are as follows. For Meta-Cal under the miscoverage rate constraint, we set the miscoverage rate tolerance to be 0.05 for all neural network classifiers and all data sets used in the experiments. For Meta-Cal under the coverage accu racy constraint, we set the desired coverage accuracy to be 0.97, 0.87, 0.85 for CIFAR-10, CIFAR-100 and Image Net, respectively. In both settings, we randomly select 1/10 samples (up to 500 samples) from the calibration data set to construct a binary classifier or estimate the coverage accuracy transformation function. |