Factuality Enhanced Language Models for Open-Ended Text Generation

Authors: Nayeon Lee, Wei Ping, Peng Xu, Mostofa Patwary, Pascale N Fung, Mohammad Shoeybi, Bryan Catanzaro

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we measure and improve the factual accuracy of large-scale LMs for open-ended text generation. We design the FACTUALITYPROMPTS test set and metrics to measure the factuality of LM generations. Based on that, we study the factual accuracy of LMs with parameter sizes ranging from 126M to 530B. Interestingly, we find that larger LMs are more factual than smaller ones, although a previous study suggests that larger LMs can be less truthful in terms of misconceptions. In addition, popular sampling algorithms (e.g., top-p) in open-ended text generation can harm the factuality due to the uniform randomness introduced at every sampling step. We propose the factual-nucleus sampling algorithm that dynamically adapts the randomness to improve the factuality of generation while maintaining quality. Furthermore, we analyze the inefficiencies of the standard training method in learning correct associations between entities from factual text corpus (e.g., Wikipedia). We propose a factuality-enhanced training method that uses TOPICPREFIX for better awareness of facts and sentence completion as the training objective, which can vastly reduce the factual errors.
Researcher Affiliation Collaboration Nayeon Lee 1, Wei Ping 2, Peng Xu2, Mostofa Patwary2, Pascale Fung1, Mohammad Shoeybi2, and Bryan Catanzaro2 1Hong Kong University of Science and Technology 2NVIDIA Work done during an internship at NVIDIA.
Pseudocode No The paper describes algorithms (e.g., factual-nucleus sampling) in prose and with mathematical formulas, but it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured, code-like steps for any method.
Open Source Code Yes The implementation can be found in https://github.com/nayeon7lee/FactualityPrompt
Open Datasets Yes Note that Wikipedia is one of the most commonly-used, accessible, large-scale, good quality, unstructured knowledge sources. Our proposed methods can easily generalize to other knowledge sources in plain text (e.g., ar Xiv papers, medical reports, reliable newspapers). ... For document-level ground-truth knowledge, we directly use the Wikipedia document annotation from the FEVER dataset.
Dataset Splits No The paper discusses concepts of training, validation, and test sets. However, it does not provide specific details on the split percentages or counts for these datasets (e.g., '80/10/10 split' or 'X training samples, Y validation samples, Z test samples'), which are necessary for exact reproducibility of data partitioning.
Hardware Specification No The paper makes general references to 'large-scale LMs' and models with 'parameter sizes ranging from 126M to 530B' and mentions 'GPU memory limit', but it does not specify any particular hardware details such as exact GPU models (e.g., NVIDIA A100), CPU types, or detailed computing infrastructure used for running the experiments.
Software Dependencies No The paper mentions several software components, such as 'Spacy.io' for named entity detection, 'Sentence Transformer [62]', and a 'RoBERTa [69] model fine-tuned on MNLI [70] dataset' leveraged via 'pytorch.org/hub/pytorch_fairseq_roberta/'. However, it does not provide specific version numbers for any of these software dependencies, which are crucial for reproducible environments.
Experiment Setup Yes In factual-nucleus sampling, the nucleus probability pt to generate the t-th token within each sentence is, pt = max{!, p λt 1}, where λ is the decay factor for top-p probability, and ! lower bounds the decay of probability. ... We report our decoding experimental results with 1.3B LM 7 in Table 4. Additions of λ-decay helps improve top-p 0.9 factuality results for instance, with decay rate λ = 0.5... Table 4 shows specific values for p, λ, and ! in the decoding settings (e.g., '0.9 | 0.9', '0.9 | 0.5', '0.9 | 0.9 | 0.7', '0.9 | 0.9 | 0.3').