Calibrating Transformers via Sparse Gaussian Processes

Authors: Wenlong Chen, Yingzhen Li

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, on a suite of prediction tasks on text, images and graphs, SGPA-based Transformers achieve competitive predictive accuracy, while noticeably improving both indistribution calibration and out-of-distribution robustness and detection.
Researcher Affiliation Academia Wenlong Chen & Yingzhen Li Imperial College London {wenlong.chen21, yingzhen.li}@imperial.ac.uk
Pseudocode No The paper provides mathematical derivations and flow diagrams (e.g., Figure 1), but it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code Yes Example code can be found at: https://github.com/chenw20/SGPA.
Open Datasets Yes Datasets: CIFAR10 & CIFAR100 (image classification (Krizhevsky et al., 2009), CV tasks); Co LA (linguistic acceptability prediction (Warstadt et al., 2019), NLP task) and IMDB (sentiment analysis, (Maas et al., 2011), NLP task).
Dataset Splits Yes Sentiment analysis with IMDB (Maas et al., 2011). We consider 5 different splits, each includes 35,000 training, 5,000 validation, and 10,000 test instances.
Hardware Specification Yes results obtained using a single Nvidia GTX 2080 Ti GPU card
Software Dependencies No Initialized via the default method of the deep learning platform (we use Pytorch (Paszke et al., 2019)). (No version for Pytorch is given).
Experiment Setup Yes For each layer, we use mean-pooling strategy. The non-linear mapping Gϕl at each attention layer is parameterised by a 2-layer MLP as in Vaswani et al. (2017). For models with kernel-based attentions, we use exponential kernel for sentiment analysis and linguistic acceptability, ... and we use ARD-RBF kernel ... For MFVI, MCD, SNGP and SGPA, predictive uncertainty is estimated using 10 Monte Carlo samples. ... all the models are trained using ADAM optimiser (Kingma & Ba, 2015), and for each input sequence in a batch, we only draw one sample to estimate the ELBO (eq. 14). ... For SGPA we use 50 global inducing points for each head. We train all the models (except post-hoc methods, TS and KFLLLA) for 20 epochs with batch-size 32 and with a initial learning rate 0.001 which decays linearly to 0.0001.