reproducibilityindex.ai

Factuality Enhanced Language Models for Open-Ended Text Generation

Authors: Nayeon Lee, Wei Ping, Peng Xu, Mostofa Patwary, Pascale N Fung, Mohammad Shoeybi, Bryan Catanzaro

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work, we measure and improve the factual accuracy of large-scale LMs for open-ended text generation. We design the FACTUALITYPROMPTS test set and metrics to measure the factuality of LM generations. Based on that, we study the factual accuracy of LMs with parameter sizes ranging from 126M to 530B. Interestingly, we ﬁnd that larger LMs are more factual than smaller ones, although a previous study suggests that larger LMs can be less truthful in terms of misconceptions. In addition, popular sampling algorithms (e.g., top-p) in open-ended text generation can harm the factuality due to the uniform randomness introduced at every sampling step. We propose the factual-nucleus sampling algorithm that dynamically adapts the randomness to improve the factuality of generation while maintaining quality. Furthermore, we analyze the inefﬁciencies of the standard training method in learning correct associations between entities from factual text corpus (e.g., Wikipedia). We propose a factuality-enhanced training method that uses TOPICPREFIX for better awareness of facts and sentence completion as the training objective, which can vastly reduce the factual errors.
Researcher Affiliation	Collaboration	Nayeon Lee 1, Wei Ping 2, Peng Xu2, Mostofa Patwary2, Pascale Fung1, Mohammad Shoeybi2, and Bryan Catanzaro2 1Hong Kong University of Science and Technology 2NVIDIA Work done during an internship at NVIDIA.
Pseudocode	No	The paper describes algorithms (e.g., factual-nucleus sampling) in prose and with mathematical formulas, but it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured, code-like steps for any method.
Open Source Code	Yes	The implementation can be found in https://github.com/nayeon7lee/FactualityPrompt
Open Datasets	Yes	Note that Wikipedia is one of the most commonly-used, accessible, large-scale, good quality, unstructured knowledge sources. Our proposed methods can easily generalize to other knowledge sources in plain text (e.g., ar Xiv papers, medical reports, reliable newspapers). ... For document-level ground-truth knowledge, we directly use the Wikipedia document annotation from the FEVER dataset.
Dataset Splits	No	The paper discusses concepts of training, validation, and test sets. However, it does not provide specific details on the split percentages or counts for these datasets (e.g., '80/10/10 split' or 'X training samples, Y validation samples, Z test samples'), which are necessary for exact reproducibility of data partitioning.
Hardware Specification	No	The paper makes general references to 'large-scale LMs' and models with 'parameter sizes ranging from 126M to 530B' and mentions 'GPU memory limit', but it does not specify any particular hardware details such as exact GPU models (e.g., NVIDIA A100), CPU types, or detailed computing infrastructure used for running the experiments.
Software Dependencies	No	The paper mentions several software components, such as 'Spacy.io' for named entity detection, 'Sentence Transformer [62]', and a 'RoBERTa [69] model ﬁne-tuned on MNLI [70] dataset' leveraged via 'pytorch.org/hub/pytorch_fairseq_roberta/'. However, it does not provide specific version numbers for any of these software dependencies, which are crucial for reproducible environments.
Experiment Setup	Yes	In factual-nucleus sampling, the nucleus probability pt to generate the t-th token within each sentence is, pt = max{!, p λt 1}, where λ is the decay factor for top-p probability, and ! lower bounds the decay of probability. ... We report our decoding experimental results with 1.3B LM 7 in Table 4. Additions of λ-decay helps improve top-p 0.9 factuality results for instance, with decay rate λ = 0.5... Table 4 shows specific values for p, λ, and ! in the decoding settings (e.g., '0.9 \| 0.9', '0.9 \| 0.5', '0.9 \| 0.9 \| 0.7', '0.9 \| 0.9 \| 0.3').