How Language Model Hallucinations Can Snowball
Authors: Muru Zhang, Ofir Press, William Merrill, Alisa Liu, Noah A. Smith
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To study this behavior empirically, we automatically construct three question-answering (QA) datasets. |
| Researcher Affiliation | Academia | 1Paul G. Allen School of Computer Science and Engineering, University of Washington 2Princeton University 3Princeton Language and Intelligence 4Center for Data Science, New York University 5Allen Institude for AI. |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | 1Data and code can be found at https://github.com/ Nanami18/Snowballed_Hallucination |
| Open Datasets | Yes | 1Data and code can be found at https://github.com/ Nanami18/Snowballed_Hallucination |
| Dataset Splits | No | The paper evaluates pre-trained LMs on custom-built datasets and does not mention explicit training, validation, or test dataset splits for their own experimental setup. |
| Hardware Specification | No | The paper mentions using GPT-3.5, GPT-4 via OpenAI API, and LLa MA2-70B-chat, but does not specify any hardware details (e.g., GPU models, CPU types, or memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions using the OpenAI API and specific language models (e.g., 'gpt-3.5-turbo'), but does not list specific version numbers for any software dependencies or libraries. |
| Experiment Setup | Yes | We run all experiments on GPT-3.5 (gpt-3.5-turbo), GPT-4, and LLa MA2-70B-chat with greedy decoding. [...] At t = 0.6 and t = 0.9, both error rates and snowballed hallucination rates remain similarly high, in all the models we tested (Figure 5). [...] We tested beam search (with the number of beams set to 10) on LLa MA-2-70B-chat only |