reproducibilityindex.ai

Verified Uncertainty Calibration

Authors: Ananya Kumar, Percy S. Liang, Tengyu Ma

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We validate our approach with multiclass calibration experiments on CIFAR-10 and Image Net, where we obtain a 35% lower calibration error than histogram binning and, unlike scaling methods, guarantees on true calibration.
Researcher Affiliation	Academia	Ananya Kumar, Percy Liang, Tengyu Ma Department of Computer Science Stanford University
Pseudocode	No	The paper describes steps for its algorithm in prose within Section 4.1 'Algorithm', but does not provide a formally structured pseudocode block or algorithm figure.
Open Source Code	Yes	We implement all these methods in a Python library: . All code, data, and experiments can be found on Coda Lab at . Updated code can be found at
Open Datasets	Yes	We validate our approach with multiclass calibration experiments on CIFAR-10 and Image Net... For Image Net, we started with a trained VGG16 model... For CIFAR-10, ... [16] A. Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009. [17] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei. Image Net: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition (CVPR), pages 248 255, 2009.
Dataset Splits	Yes	For Image Net, we split the validation set into 3 sets of sizes (20000, 5000, 25000)... The CIFAR-10 validation set has 10,000 data points. We sampled, with replacement, a recalibration set of 1,000 points. ... We split the validation set of size 10,000 into two sets SC and SE of sizes 3,000 and 7,000 respectively.
Hardware Specification	No	No specific hardware details such as GPU or CPU models, memory amounts, or cloud instance types are provided for the experiments.
Software Dependencies	No	The paper mentions implementing methods in a Python library and cites Keras and TensorFlow, but does not provide specific version numbers for any software dependencies used in the experiments.
Experiment Setup	Yes	We recalibrated each class separately as in [13], using B bins per class, and evaluated calibration using the marginal calibration error (Deﬁnition 2.4). We describe our experimental protocol for CIFAR-10. The CIFAR-10 validation set has 10,000 data points. We sampled, with replacement, a recalibration set of 1,000 points. We ran either the scaling-binning calibrator (we ﬁt a sigmoid in the function ﬁtting step) or histogram binning and measured the marginal calibration error on the entire set of 10K points. We repeated this entire procedure 100 times and computed mean and 90% conﬁdence intervals, and we repeated this varying the number of bins B.