reproducibilityindex.ai

Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation

Authors: Lorenz Kuhn, Yarin Gal, Sebastian Farquhar

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In comprehensive ablation studies we show that the semantic entropy is more predictive of model accuracy on question answering data sets than comparable baselines.
Researcher Affiliation	Academia	Lorenz Kuhn, Yarin Gal, Sebastian Farquhar OATML Group, Department of Computer Science, University of Oxford
Pseudocode	Yes	Algorithm pseudocode is provided in Appendix A.2. in Algorithm 1 we provide the pseudocode for our bi-directional entailment algorithm.
Open Source Code	Yes	All code and data used in our experiments are available at https://github.com/lorenzkuhn/semantic_uncertainty. We make all of our code, as well as the hand-labelled semantic equivalence dataset drawn from Trivia QA and Co QA, available under an MIT license.
Open Datasets	Yes	Datasets. We use Co QA Reddy et al. (2019) as an open-book conversational question answering problem... We also use Trivia QA (Joshi et al., 2017) as a closed-book QA problem...
Dataset Splits	No	The paper uses pre-trained models and evaluates them on specific splits (development split, subset of training split) of publicly available datasets. It does not describe its own training/validation splits as it does not train the models from scratch.
Hardware Specification	Yes	We run all of our experiments on 80GB NVIDIA A100s.
Software Dependencies	No	We use both the OPT models and the Deberta-large model via the Hugging Face transformers library. This mentions software components but does not provide specific version numbers for the 'transformers' library.
Experiment Setup	Yes	We sample these sequences only from a single model using either multinomial sampling or multinomial beam sampling... the choice of sampling temperature... The optimal temperature is 0.5... we use beam search using the generate() function with num beams = 5 and do sample = True... by default we use multinomial sampling, that is generate() using do sample = True and num beams = 1. We use 10 sampled answers per question... For Trivia QA, we use a 10-shot prompt...