reproducibilityindex.ai

Long-form factuality in large language models

Authors: Jerry Wei, Chengrun Yang, Xinying Song, Yifeng Lu, Nathan Hu, Jie Huang, Dustin Tran, Daiyi Peng, Ruibo Liu, Da Huang, Cosmo Du, Quoc V Le

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, we demonstrate that LLM agents can outperform crowdsourced human annotators on a set of 16k individual facts, SAFE agrees with crowdsourced human annotators 72% of the time, and on a random subset of 100 disagreement cases, SAFE wins 76% of the time. At the same time, SAFE is more than 20 times cheaper than human annotators. We also benchmark thirteen language models on Long Fact across four model families (Gemini, GPT, Claude, and Pa LM-2), ﬁnding that larger language models generally achieve better long-form factuality.
Researcher Affiliation	Collaboration	1 Google Deep Mind 2 Stanford University 3 University of Illinois at Urbana-Champaign
Pseudocode	No	The paper describes the steps of the SAFE method in text and diagrams (e.g., Figure 1), but does not present them in structured pseudocode or a clearly labeled algorithm block.
Open Source Code	Yes	Long Fact, SAFE, and all experimental code are available at https://github. com/google-deepmind/long-form-factuality.
Open Datasets	Yes	We make Long Fact publicly available at https://github. com/google-deepmind/ long-form-factuality/tree/main/longfact.
Dataset Splits	No	The paper states 'We evaluated each model on the same random subset of 250 prompts from Long Fact-Objects' and 'We did not train any models, we evaluated existing models in Section 6'. It describes a fixed subset used for evaluation, but not a separate validation split in the context of reproducing training for a model presented in the paper.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments. It mentions using GPT-3.5-Turbo and Serper API, which are external services, and states that 'Most experiments rely on querying Open AI models (or other proprietary models) that are not publicly-available, making it difﬁcult to calculate the compute used for these experiments.'
Software Dependencies	Yes	We use gpt-3.5-turbo-0125 for all of our experiments with SAFE.
Experiment Setup	Yes	We use gpt-4-0613 at a temperature of 1.0 and a max decode length of 128 tokens. We use gpt-3.5-turbo-0125 for all of our experiments with SAFE. We use a temperature of 0 for this step; all other steps use a temperature of 0.1. We allow the model to issue ﬁve (5) search queries per fact and return three (3) search results per query. We decode model responses up to 1,024 tokens at a temperature of zero. We selected K = 64 (the median number of relevant facts among all model responses for the tested prompts, and K = 178 (the maximum number of relevant facts in a response among all model responses for the tested prompts).