Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation
Authors: Lorenz Kuhn, Yarin Gal, Sebastian Farquhar
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In comprehensive ablation studies we show that the semantic entropy is more predictive of model accuracy on question answering data sets than comparable baselines. |
| Researcher Affiliation | Academia | Lorenz Kuhn, Yarin Gal, Sebastian Farquhar OATML Group, Department of Computer Science, University of Oxford |
| Pseudocode | Yes | Algorithm pseudocode is provided in Appendix A.2. in Algorithm 1 we provide the pseudocode for our bi-directional entailment algorithm. |
| Open Source Code | Yes | All code and data used in our experiments are available at https://github.com/lorenzkuhn/semantic_uncertainty. We make all of our code, as well as the hand-labelled semantic equivalence dataset drawn from Trivia QA and Co QA, available under an MIT license. |
| Open Datasets | Yes | Datasets. We use Co QA Reddy et al. (2019) as an open-book conversational question answering problem... We also use Trivia QA (Joshi et al., 2017) as a closed-book QA problem... |
| Dataset Splits | No | The paper uses pre-trained models and evaluates them on specific splits (development split, subset of training split) of publicly available datasets. It does not describe its own training/validation splits as it does not train the models from scratch. |
| Hardware Specification | Yes | We run all of our experiments on 80GB NVIDIA A100s. |
| Software Dependencies | No | We use both the OPT models and the Deberta-large model via the Hugging Face transformers library. This mentions software components but does not provide specific version numbers for the 'transformers' library. |
| Experiment Setup | Yes | We sample these sequences only from a single model using either multinomial sampling or multinomial beam sampling... the choice of sampling temperature... The optimal temperature is 0.5... we use beam search using the generate() function with num beams = 5 and do sample = True... by default we use multinomial sampling, that is generate() using do sample = True and num beams = 1. We use 10 sampled answers per question... For Trivia QA, we use a 10-shot prompt... |