Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Top-label calibration and multiclass-to-binary reductions
Authors: Chirag Gupta, Aaditya Ramdas
ICLR 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In an empirical evaluation with four deep net architectures on CIFAR-10 and CIFAR-100, we find that the M2B + HB procedure achieves lower top-label and class-wise calibration error than other approaches such as temperature scaling. |
| Researcher Affiliation | Academia | Chirag Gupta & Aaditya Ramdas Carnegie Mellon University EMAIL |
| Pseudocode | Yes | Algorithm 1: Confidence calibrator, Algorithm 2: Top-label calibrator, Algorithm 3: Class-wise calibrator, Algorithm 4: Normalized calibrator, Algorithm 5: Post-hoc calibrator for a given M2B calibration notion C, Algorithm 6: Top-K-label calibrator, Algorithm 7: Top-K-confidence calibrator, Algorithm 8: Top-label histogram binning, Algorithm 9: Class-wise histogram binning |
| Open Source Code | Yes | Code for this work is available at https://github.com/aigen/df-posthoc-calibration. |
| Open Datasets | Yes | We experimented on the CIFAR-10 and CIFAR-100 datasets |
| Dataset Splits | Yes | Both CIFAR datasets consist of 60K (60,000) points, which are split as 45K/5K/10K to form the train/validation/test sets. |
| Hardware Specification | No | This work used the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by National Science Foundation grant number ACI-1548562 (Towns et al., 2014). Specifically, it used the Bridges-2 system, which is supported by NSF award number ACI-1928147, at the Pittsburgh Supercomputing Center (PSC). This provides names of computing resources, not specific hardware components like GPU/CPU models or memory, making it not reproducible in terms of specific hardware. |
| Software Dependencies | No | We also used the code at https://github.com/torrvision/focal_calibration for temperature scaling (TS). For vector scaling (VS) and Dirichlet scaling (DS), we used the code of Kull et al. (2019), hosted at https://github.com/dirichletcal/dirichlet_python. This mentions software by name and URL, but does not provide specific version numbers. |
| Experiment Setup | Yes | No hyperparameter tuning was performed in any of our histogram binning experiments or baseline experiments; default settings were used in every case. The random seed was fixed so that every run of the experiment gives the same result. Hyperparameter: # points per bin k P N (say 50), tie-breaking parameter δ > 0 (say 10^-10). |