Calibrating Transformers via Sparse Gaussian Processes
Authors: Wenlong Chen, Yingzhen Li
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, on a suite of prediction tasks on text, images and graphs, SGPA-based Transformers achieve competitive predictive accuracy, while noticeably improving both indistribution calibration and out-of-distribution robustness and detection. |
| Researcher Affiliation | Academia | Wenlong Chen & Yingzhen Li Imperial College London {wenlong.chen21, yingzhen.li}@imperial.ac.uk |
| Pseudocode | No | The paper provides mathematical derivations and flow diagrams (e.g., Figure 1), but it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | Yes | Example code can be found at: https://github.com/chenw20/SGPA. |
| Open Datasets | Yes | Datasets: CIFAR10 & CIFAR100 (image classification (Krizhevsky et al., 2009), CV tasks); Co LA (linguistic acceptability prediction (Warstadt et al., 2019), NLP task) and IMDB (sentiment analysis, (Maas et al., 2011), NLP task). |
| Dataset Splits | Yes | Sentiment analysis with IMDB (Maas et al., 2011). We consider 5 different splits, each includes 35,000 training, 5,000 validation, and 10,000 test instances. |
| Hardware Specification | Yes | results obtained using a single Nvidia GTX 2080 Ti GPU card |
| Software Dependencies | No | Initialized via the default method of the deep learning platform (we use Pytorch (Paszke et al., 2019)). (No version for Pytorch is given). |
| Experiment Setup | Yes | For each layer, we use mean-pooling strategy. The non-linear mapping Gϕl at each attention layer is parameterised by a 2-layer MLP as in Vaswani et al. (2017). For models with kernel-based attentions, we use exponential kernel for sentiment analysis and linguistic acceptability, ... and we use ARD-RBF kernel ... For MFVI, MCD, SNGP and SGPA, predictive uncertainty is estimated using 10 Monte Carlo samples. ... all the models are trained using ADAM optimiser (Kingma & Ba, 2015), and for each input sequence in a batch, we only draw one sample to estimate the ELBO (eq. 14). ... For SGPA we use 50 global inducing points for each head. We train all the models (except post-hoc methods, TS and KFLLLA) for 20 epochs with batch-size 32 and with a initial learning rate 0.001 which decays linearly to 0.0001. |