On Calibration of Modern Neural Networks

Authors: Chuan Guo, Geoff Pleiss, Yu Sun, Kilian Q. Weinberger

ICML 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through extensive experiments, we observe that depth, width, weight decay, and Batch Normalization are important factors influencing calibration. We evaluate the performance of various post-processing calibration methods on state-of-the-art architectures with image and document classification datasets. Our analysis and experiments not only offer insights into neural network learning, but also provide a simple and straightforward recipe for practical settings: on most datasets, temperature scaling a singleparameter variant of Platt Scaling is surprisingly effective at calibrating predictions.
Researcher Affiliation Academia 1Cornell University. Correspondence to: Chuan Guo <cg563@cornell.edu>, Geoff Pleiss <geoff@cs.cornell.edu>, Yu Sun <ys646@cornell.edu>.
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any explicit statement about releasing source code or a link to a code repository for the described methodology.
Open Datasets Yes For image classification we use 6 datasets: 1. Caltech-UCSD Birds (Welinder et al., 2010):... 2. Stanford Cars (Krause et al., 2013):... 3. Image Net 2012 (Deng et al., 2009):... 4. CIFAR-10/CIFAR-100 (Krizhevsky & Hinton, 2009):... 5. Street View House Numbers (SVHN) (Netzer et al., 2011):... For document classification we experiment with 4 datasets: 1. 20 News:... 2. Reuters:... 3. Stanford Sentiment Treebank (SST) (Socher et al., 2013):...
Dataset Splits Yes 1. Caltech-UCSD Birds (Welinder et al., 2010): 200 bird species. 5994/2897/2897 images for train/validation/test sets. 2. Stanford Cars (Krause et al., 2013):... 8041/4020/4020 images for train/validation/test. 3. Image Net 2012 (Deng et al., 2009):... 1.3 million/25,000/25,000 images for train/validation/test. 4. CIFAR-10/CIFAR-100 (Krizhevsky & Hinton, 2009):... 45,000/5,000/10,000 images for train/validation/test. 5. Street View House Numbers (SVHN) (Netzer et al., 2011):... 604,388/6,000/26,032 images for train/validation/test. 20 News: ...9034/2259/7528 documents for train/validation/test.
Hardware Specification No The paper does not provide specific hardware details such as exact GPU or CPU models used for running the experiments.
Software Dependencies No The paper mentions 'Torch7' and 'authors code' but does not provide specific version numbers for software dependencies needed for replication.
Experiment Setup No We use the data preprocessing, training procedures, and hyperparameters as described in each paper. These networks obtain competitive accuracy using the optimization hyperparameters suggested by the original paper. On SST, we train Tree LSTMs (Long Short Term Memory) (Tai et al., 2015) using the default settings in the authors code.