Revisiting the Calibration of Modern Neural Networks
Authors: Matthias Minderer, Josip Djolonga, Rob Romijnders, Frances Hubis, Xiaohua Zhai, Neil Houlsby, Dustin Tran, Mario Lucic
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We systematically relate model calibration and accuracy, and find that the most recent models, notably those not using convolutions, are among the best calibrated. Trends observed in prior model generations, such as decay of calibration with distribution shift or model size, are less pronounced in recent architectures. We also show that model size and amount of pretraining do not fully explain these differences, suggesting that architecture is a major determinant of calibration properties. |
| Researcher Affiliation | Industry | Matthias Minderer Josip Djolonga Rob Romijnders Frances Hubis Xiaohua Zhai Neil Houlsby Dustin Tran Mario Lucic Google Research, Brain Team {mjlm, lucic}@google.com |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | We provide code and a large dataset of calibration measurements, comprising 180 distinct models from 16 families, each evaluated on 79 Image Net-scale datasets and 28 metric variants.1 Available at https://github.com/google-research/robustness_metrics/tree/master/ robustness_metrics/projects/revisiting_calibration. |
| Open Datasets | Yes | All models are either trained or fine-tuned on the IMAGENET training set, except for CLIP... We evaluate accuracy and calibration on the IMAGENET validation set and the following out-of-distribution benchmarks using the Robustness Metrics library (Djolonga et al., 2020): 1. IMAGENETV2 (Recht et al., 2019) is a new IMAGENET test set... 2. IMAGENET-C (Hendrycks & Dietterich, 2019)... 3. IMAGENET-R (Hendrycks et al., 2020a)... 4. IMAGENET-A (Hendrycks et al., 2021)... |
| Dataset Splits | Yes | For the post-hoc recalibration of models, we reserve 20% of the IMAGENET validation set (randomly sampled) for fitting the temperature scaling parameter. All reported metrics are computed on the remaining 80% of the data. |
| Hardware Specification | No | The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments. |
| Software Dependencies | No | The paper does not provide specific ancillary software details (e.g., library or solver names with version numbers like Python 3.8, CPLEX 12.4) needed to replicate the experiment. |
| Experiment Setup | Yes | Throughout the paper, we estimate ECE using equal-mass binning and 100 bins... For the post-hoc recalibration of models, we reserve 20% of the IMAGENET validation set (randomly sampled) for fitting the temperature scaling parameter. |