Top-label calibration and multiclass-to-binary reductions

Authors: Chirag Gupta, Aaditya Ramdas

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In an empirical evaluation with four deep net architectures on CIFAR-10 and CIFAR-100, we find that the M2B + HB procedure achieves lower top-label and class-wise calibration error than other approaches such as temperature scaling.
Researcher Affiliation Academia Chirag Gupta & Aaditya Ramdas Carnegie Mellon University {chiragg,aramdas}@cmu.edu
Pseudocode Yes Algorithm 1: Confidence calibrator, Algorithm 2: Top-label calibrator, Algorithm 3: Class-wise calibrator, Algorithm 4: Normalized calibrator, Algorithm 5: Post-hoc calibrator for a given M2B calibration notion C, Algorithm 6: Top-K-label calibrator, Algorithm 7: Top-K-confidence calibrator, Algorithm 8: Top-label histogram binning, Algorithm 9: Class-wise histogram binning
Open Source Code Yes Code for this work is available at https://github.com/aigen/df-posthoc-calibration.
Open Datasets Yes We experimented on the CIFAR-10 and CIFAR-100 datasets
Dataset Splits Yes Both CIFAR datasets consist of 60K (60,000) points, which are split as 45K/5K/10K to form the train/validation/test sets.
Hardware Specification No This work used the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by National Science Foundation grant number ACI-1548562 (Towns et al., 2014). Specifically, it used the Bridges-2 system, which is supported by NSF award number ACI-1928147, at the Pittsburgh Supercomputing Center (PSC). This provides names of computing resources, not specific hardware components like GPU/CPU models or memory, making it not reproducible in terms of specific hardware.
Software Dependencies No We also used the code at https://github.com/torrvision/focal_calibration for temperature scaling (TS). For vector scaling (VS) and Dirichlet scaling (DS), we used the code of Kull et al. (2019), hosted at https://github.com/dirichletcal/dirichlet_python. This mentions software by name and URL, but does not provide specific version numbers.
Experiment Setup Yes No hyperparameter tuning was performed in any of our histogram binning experiments or baseline experiments; default settings were used in every case. The random seed was fixed so that every run of the experiment gives the same result. Hyperparameter: # points per bin k P N (say 50), tie-breaking parameter δ > 0 (say 10^-10).