ESD: Expected Squared Difference as a Tuning-Free Trainable Calibration Measure

Authors: Hee Suk Yoon, Joshua Tian Jin Tee, Eunseop Yoon, Sunjae Yoon, Gwangsu Kim, Yingzhen Li, Chang D. Yoo

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental With extensive experiments on several architectures (CNNs, Transformers) and datasets, we demonstrate that (1) incorporating ESD into the training improves model calibration in various batch size settings without the need for internal hyperparameter tuning, (2) ESD yields the best-calibrated results compared with previous approaches, and (3) ESD drastically improves the computational costs required for calibration during training due to the absence of internal hyperparameter.
Researcher Affiliation Academia 1Korea Advanced Institute of Science and Technology (KAIST) 2Imperial College London
Pseudocode Yes C EXPECTED SQUARED DIFFERENCE (ESD) PSEUDOCODE In this section, we provide a pseudocode for calculating the Expected Squared Difference (ESD) for a given batch output. Algorithm 1: Pytorch-like Pseudocode: Expected Squared Difference (ESD)
Open Source Code Yes The code is publicly accessible at https://github.com/heesuk-yoon/ESD.
Open Datasets Yes MNIST (Deng, 2012)...CIFAR10 & CIFAR100 (Krizhevsky et al., a;b)...Image Net100 (Deng et al., 2009)...SNLI (Bowman et al., 2015)...ANLI (Nie et al., 2020)
Dataset Splits Yes MNIST (Deng, 2012): 54,000/6,000/10,000 images for train, validation, and test split was used. ... CIFAR10 & CIFAR100 (Krizhevsky et al., a;b): 45,000/5,000/10,000 images for train, validation, and test split was used. ... Image Net100 (Deng et al., 2009): ...117,000/13,000/5,000 split for train/val/test set. ... SNLI (Bowman et al., 2015): ...550,152/10,000/10,000 sentence pairs for train/val/test set respectively. ... ANLI (Nie et al., 2020): ...162,865/3,200/3,200 sentence pairs for train/val/test set respectively.
Hardware Specification Yes All experiments were done using NVIDIA Quadro RTX 8000 and NVIDIA RTX A6000.
Software Dependencies No The paper mentions optimizers like Adam W and models like Bert-base and Roberta-base, and the pseudocode is described as 'Pytorch-like', but it does not specify version numbers for any software dependencies (e.g., Python, PyTorch, TensorFlow).
Experiment Setup Yes For the image classification tasks we use Adam W (Loshchilov & Hutter, 2019) optimizer with 10 3 learning rate and 10 2 weight decay for 250 epochs, except for Image Net100, in which case we used 10 4 weight decay for 90 epochs. For the NLI tasks, we use Adam W optimizer with 10 5 learning rate and 10 2 weight decay for 15 epochs. For both tasks, we use a batch size of 512. ... For the interleaved training settings, we held out 10% of the train set to the calibration set. The regularizer hyperparameter λ for weighting the calibration measure with respect to NLL is chosen via fixed grid search 2. For measuring calibration error, we use ECE with 20 equally sized bins. ... For ϕ, we search [0.2, 0.4, 0.6, 0.8]. For T, we search [0.0001, 0.001, 0.01, 0.1].