A Geometric Perspective towards Neural Calibration via Sensitivity Decomposition

Authors: Junjiao Tian, Dylan Yung, Yen-Chang Hsu, Zsolt Kira

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 4 Experiments
Researcher Affiliation Collaboration Junjiao Tian Georgia Institute of Technology jtian73@gatech.edu Dylan Yung Georgia Institute of Technology dyung6@gatech.edu Yen-Chang Hsu Samsung Research America yenchang.hsu@samsung.com Zsolt Kira Georgia Institute of Technology zkira@gatech.edu
Pseudocode No No structured pseudocode or algorithm blocks were found.
Open Source Code Yes Code available at https: //github.com/GT-RIPL/Geometric-Sensitivity-Decomposition.git.
Open Datasets Yes Following prior works [9, 8, 5], we will use CIFAR10 and CIFAR100 as the in-distribution training and testing dataset, and apply the image corruption library provided by [1] to benchmark calibration performance under distribution shift.
Dataset Splits Yes The first step is calibrating the model on IND validation set (note our method does not rely on OOD validation data), similar to temperature calibration [4]. However, instead of tuning a temperature parameter as shown in Fig. 1a, we simply tune the offset parameter β on the validation set in one of two ways: 1) grid-search based on minimizing Expected Calibration Error (see Sec. 4) 2) SGD optimization based on Negative Log Likelihood [4].
Hardware Specification No No specific hardware details (e.g., GPU/CPU models, memory) used for running experiments were provided in the paper.
Software Dependencies No The paper does not provide specific software dependencies with version numbers.
Experiment Setup Yes The new model can be trained using the same training procedures as the vanilla network without additional hyperparameter tuning, changing the architecture or extended training time. We regularize α such that the instance-independent component Cφ is small. Specifically, we penalize α 1 2 2 because α = cos Cφ, i.e., if α 1, Cφ 0. We empirically found that a larger relaxation angle Cφ deteriorates performance because the angular similarity already correlates well with difficulty of data [11] and we do not need to encourage a large relaxation. Sec. 4.3 will empirically verify this. We simply tune the offset parameter β on the validation set in one of two ways: 1) grid-search based on minimizing Expected Calibration Error (see Sec. 4) 2) SGD optimization based on Negative Log Likelihood [4]. Because these are post-training procedure, both methods are very efficient. We denote the new parameter as β . For β Optimized, it states: "optimize β on the validation set via gradient decent to minimize NLL for 10 epochs". It also states "c is a hyperparameter which can be calculated as in Eq. 10. The non-linear function grows exponentially close to the calibrated affine mapping in Eq. 8 dictated by 1 e c x 2 as shown in 1c. Therefore, e c x 2 can be viewed as an error term that quantifies how close the non-linear function is to the calibrated affine function in Eq. 8. Let µx and σx denote the mean and standard deviation of the distribution of the norm of IND sample embedding calculated on the validation set. We use the heuristic that when evaluated at one standard deviation below the mean, x 2 = µx σx, the approximation error e c(µx σx) = 0.1. Even though the error threshold is a hyperparameter, using an error of 0.1 lead to state-of-the-art results across all models applied. c = ln(1 error) / (µx σx) = ln(0.9) / (µx σx)"