Verified Uncertainty Calibration

Authors: Ananya Kumar, Percy S. Liang, Tengyu Ma

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We validate our approach with multiclass calibration experiments on CIFAR-10 and Image Net, where we obtain a 35% lower calibration error than histogram binning and, unlike scaling methods, guarantees on true calibration.
Researcher Affiliation Academia Ananya Kumar, Percy Liang, Tengyu Ma Department of Computer Science Stanford University
Pseudocode No The paper describes steps for its algorithm in prose within Section 4.1 'Algorithm', but does not provide a formally structured pseudocode block or algorithm figure.
Open Source Code Yes We implement all these methods in a Python library: . All code, data, and experiments can be found on Coda Lab at . Updated code can be found at
Open Datasets Yes We validate our approach with multiclass calibration experiments on CIFAR-10 and Image Net... For Image Net, we started with a trained VGG16 model... For CIFAR-10, ... [16] A. Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009. [17] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei. Image Net: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition (CVPR), pages 248 255, 2009.
Dataset Splits Yes For Image Net, we split the validation set into 3 sets of sizes (20000, 5000, 25000)... The CIFAR-10 validation set has 10,000 data points. We sampled, with replacement, a recalibration set of 1,000 points. ... We split the validation set of size 10,000 into two sets SC and SE of sizes 3,000 and 7,000 respectively.
Hardware Specification No No specific hardware details such as GPU or CPU models, memory amounts, or cloud instance types are provided for the experiments.
Software Dependencies No The paper mentions implementing methods in a Python library and cites Keras and TensorFlow, but does not provide specific version numbers for any software dependencies used in the experiments.
Experiment Setup Yes We recalibrated each class separately as in [13], using B bins per class, and evaluated calibration using the marginal calibration error (Definition 2.4). We describe our experimental protocol for CIFAR-10. The CIFAR-10 validation set has 10,000 data points. We sampled, with replacement, a recalibration set of 1,000 points. We ran either the scaling-binning calibrator (we fit a sigmoid in the function fitting step) or histogram binning and measured the marginal calibration error on the entire set of 10K points. We repeated this entire procedure 100 times and computed mean and 90% confidence intervals, and we repeated this varying the number of bins B.