reproducibilityindex.ai

How Language Model Hallucinations Can Snowball

Authors: Muru Zhang, Ofir Press, William Merrill, Alisa Liu, Noah A. Smith

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To study this behavior empirically, we automatically construct three question-answering (QA) datasets.
Researcher Affiliation	Academia	1Paul G. Allen School of Computer Science and Engineering, University of Washington 2Princeton University 3Princeton Language and Intelligence 4Center for Data Science, New York University 5Allen Institude for AI.
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code	Yes	1Data and code can be found at https://github.com/ Nanami18/Snowballed_Hallucination
Open Datasets	Yes	1Data and code can be found at https://github.com/ Nanami18/Snowballed_Hallucination
Dataset Splits	No	The paper evaluates pre-trained LMs on custom-built datasets and does not mention explicit training, validation, or test dataset splits for their own experimental setup.
Hardware Specification	No	The paper mentions using GPT-3.5, GPT-4 via OpenAI API, and LLa MA2-70B-chat, but does not specify any hardware details (e.g., GPU models, CPU types, or memory) used for running the experiments.
Software Dependencies	No	The paper mentions using the OpenAI API and specific language models (e.g., 'gpt-3.5-turbo'), but does not list specific version numbers for any software dependencies or libraries.
Experiment Setup	Yes	We run all experiments on GPT-3.5 (gpt-3.5-turbo), GPT-4, and LLa MA2-70B-chat with greedy decoding. [...] At t = 0.6 and t = 0.9, both error rates and snowballed hallucination rates remain similarly high, in all the models we tested (Figure 5). [...] We tested beam search (with the number of beams set to 10) on LLa MA-2-70B-chat only