How Language Model Hallucinations Can Snowball

Authors: Muru Zhang, Ofir Press, William Merrill, Alisa Liu, Noah A. Smith

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To study this behavior empirically, we automatically construct three question-answering (QA) datasets.
Researcher Affiliation Academia 1Paul G. Allen School of Computer Science and Engineering, University of Washington 2Princeton University 3Princeton Language and Intelligence 4Center for Data Science, New York University 5Allen Institude for AI.
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes 1Data and code can be found at https://github.com/ Nanami18/Snowballed_Hallucination
Open Datasets Yes 1Data and code can be found at https://github.com/ Nanami18/Snowballed_Hallucination
Dataset Splits No The paper evaluates pre-trained LMs on custom-built datasets and does not mention explicit training, validation, or test dataset splits for their own experimental setup.
Hardware Specification No The paper mentions using GPT-3.5, GPT-4 via OpenAI API, and LLa MA2-70B-chat, but does not specify any hardware details (e.g., GPU models, CPU types, or memory) used for running the experiments.
Software Dependencies No The paper mentions using the OpenAI API and specific language models (e.g., 'gpt-3.5-turbo'), but does not list specific version numbers for any software dependencies or libraries.
Experiment Setup Yes We run all experiments on GPT-3.5 (gpt-3.5-turbo), GPT-4, and LLa MA2-70B-chat with greedy decoding. [...] At t = 0.6 and t = 0.9, both error rates and snowballed hallucination rates remain similarly high, in all the models we tested (Figure 5). [...] We tested beam search (with the number of beams set to 10) on LLa MA-2-70B-chat only