reproducibilityindex.ai

Revisiting the Calibration of Modern Neural Networks

Authors: Matthias Minderer, Josip Djolonga, Rob Romijnders, Frances Hubis, Xiaohua Zhai, Neil Houlsby, Dustin Tran, Mario Lucic

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We systematically relate model calibration and accuracy, and ﬁnd that the most recent models, notably those not using convolutions, are among the best calibrated. Trends observed in prior model generations, such as decay of calibration with distribution shift or model size, are less pronounced in recent architectures. We also show that model size and amount of pretraining do not fully explain these differences, suggesting that architecture is a major determinant of calibration properties.
Researcher Affiliation	Industry	Matthias Minderer Josip Djolonga Rob Romijnders Frances Hubis Xiaohua Zhai Neil Houlsby Dustin Tran Mario Lucic Google Research, Brain Team {mjlm, lucic}@google.com
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code	Yes	We provide code and a large dataset of calibration measurements, comprising 180 distinct models from 16 families, each evaluated on 79 Image Net-scale datasets and 28 metric variants.1 Available at https://github.com/google-research/robustness_metrics/tree/master/ robustness_metrics/projects/revisiting_calibration.
Open Datasets	Yes	All models are either trained or ﬁne-tuned on the IMAGENET training set, except for CLIP... We evaluate accuracy and calibration on the IMAGENET validation set and the following out-of-distribution benchmarks using the Robustness Metrics library (Djolonga et al., 2020): 1. IMAGENETV2 (Recht et al., 2019) is a new IMAGENET test set... 2. IMAGENET-C (Hendrycks & Dietterich, 2019)... 3. IMAGENET-R (Hendrycks et al., 2020a)... 4. IMAGENET-A (Hendrycks et al., 2021)...
Dataset Splits	Yes	For the post-hoc recalibration of models, we reserve 20% of the IMAGENET validation set (randomly sampled) for ﬁtting the temperature scaling parameter. All reported metrics are computed on the remaining 80% of the data.
Hardware Specification	No	The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies	No	The paper does not provide specific ancillary software details (e.g., library or solver names with version numbers like Python 3.8, CPLEX 12.4) needed to replicate the experiment.
Experiment Setup	Yes	Throughout the paper, we estimate ECE using equal-mass binning and 100 bins... For the post-hoc recalibration of models, we reserve 20% of the IMAGENET validation set (randomly sampled) for ﬁtting the temperature scaling parameter.