Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation

Authors: Lorenz Kuhn, Yarin Gal, Sebastian Farquhar

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In comprehensive ablation studies we show that the semantic entropy is more predictive of model accuracy on question answering data sets than comparable baselines.
Researcher Affiliation Academia Lorenz Kuhn, Yarin Gal, Sebastian Farquhar OATML Group, Department of Computer Science, University of Oxford
Pseudocode Yes Algorithm pseudocode is provided in Appendix A.2. in Algorithm 1 we provide the pseudocode for our bi-directional entailment algorithm.
Open Source Code Yes All code and data used in our experiments are available at https://github.com/lorenzkuhn/semantic_uncertainty. We make all of our code, as well as the hand-labelled semantic equivalence dataset drawn from Trivia QA and Co QA, available under an MIT license.
Open Datasets Yes Datasets. We use Co QA Reddy et al. (2019) as an open-book conversational question answering problem... We also use Trivia QA (Joshi et al., 2017) as a closed-book QA problem...
Dataset Splits No The paper uses pre-trained models and evaluates them on specific splits (development split, subset of training split) of publicly available datasets. It does not describe its own training/validation splits as it does not train the models from scratch.
Hardware Specification Yes We run all of our experiments on 80GB NVIDIA A100s.
Software Dependencies No We use both the OPT models and the Deberta-large model via the Hugging Face transformers library. This mentions software components but does not provide specific version numbers for the 'transformers' library.
Experiment Setup Yes We sample these sequences only from a single model using either multinomial sampling or multinomial beam sampling... the choice of sampling temperature... The optimal temperature is 0.5... we use beam search using the generate() function with num beams = 5 and do sample = True... by default we use multinomial sampling, that is generate() using do sample = True and num beams = 1. We use 10 sampled answers per question... For Trivia QA, we use a 10-shot prompt...