Transformer Uncertainty Estimation with Hierarchical Stochastic Attention

Authors: Jiahuan Pei, Cheng Wang, György Szarvas11147-11155

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically evaluate our model on two text classification tasks with both in-domain (ID) and outof-domain (OOD) datasets. The experimental results demonstrate that our approach: (1) achieves the best predictive performance and uncertainty trade-off among compared methods; (2) exhibits very competitive (in most cases, improved) predictive performance on ID datasets; (3) is on par with Monte Carlo dropout and ensemble methods in uncertainty estimation on OOD datasets.
Researcher Affiliation Collaboration 1University of Amsterdam 2Amazon Development Center Germany Gmb H, Berlin, Germany
Pseudocode Yes Algorithm 1: Hierarchical stochastic transformer.
Open Source Code No The paper does not include any explicit statement about releasing source code for the methodology, nor does it provide a direct link to a code repository.
Open Datasets Yes We use IMDB dataset1 (Maas et al. 2011) for the sentiment analysis task... Besides, we use customer review (CR) dataset (Hendrycks and Gimpel 2017) which has 500 samples to evaluate the proposed model in OOD settings. We conduct the second experiment on linguistic acceptability task with Co LA dataset2 (Warstadt, Singh, and Bowman 2019).
Dataset Splits Yes For hyperparameter selection, we take 10% of training data as validation set, leading to 22,500/2,500/25,000 data samples for training, validation, and testing. ... It consists of 8,551 training and 527 validation in-domain samples.
Hardware Specification No The paper does not explicitly describe the specific hardware used to run the experiments. It mentions 'We implement models in Py Torch' but does not specify GPU/CPU models, memory, or cloud resources.
Software Dependencies No The paper mentions: 'We implement models in Py Torch (Paszke et al. 2019). The models are trained with Adam (Kingma and Ba 2014) as the optimization algorithm.' While PyTorch and Adam are mentioned, specific version numbers for PyTorch or other libraries are not provided.
Experiment Setup Yes For sentiment analysis, we use 1 layer with 8 heads, both the embedding size and the hidden dimension size are 128. We train the model with learning rate of 1e-3, batch size of 128, and dropout rate of 0.5/0.1. We evaluate models at each epoch, and the models are trained with maximum 50 epochs. ... For linguistic acceptability, we use 8 layers and 8 heads, the embedding size is 128 and the hidden dimension is 512. We train the model with learning rate of 5e-5, batch size of 32 and dropout rate of 0.1. We train the models with maximum 2000 epochs and evaluate the models at every 50 epochs.