reproducibilityindex.ai

LLM-Check: Investigating Detection of Hallucinations in Large Language Models

Authors: Gaurang Sriramanan, Siddhant Bharti, Vinu Sankar Sadasivan, Shoumik Saha, Priyatham Kattakinda, Soheil Feizi

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper, we conduct a comprehensive investigation into the nature of hallucinations within LLMs and furthermore explore effective techniques for detecting such inaccuracies in various real-world settings. We demonstrate that the proposed detection methods are extremely compute-efficient, with speedups of up to 45x and 450x over other baselines, while achieving significant improvements in detection performance over diverse datasets.
Researcher Affiliation	Academia	Gaurang Sriramanan gaurangs@cs.umd.edu Siddhant Bharti sbharti@cs.umd.edu Vinu Sankar Sadasivan vinu@cs.umd.edu Shoumik Saha smksaha@cs.umd.edu Priyatham Kattakinda pkattaki@umd.edu Soheil Feizi sfeizi@cs.umd.edu Department of Computer Science University of Maryland, College Park, USA
Pseudocode	No	No pseudocode or algorithm block found.
Open Source Code	Yes	1The codebase for LLM-Check is available at this URL
Open Datasets	Yes	For the setting of detection without external references with a single model response, we utilize the FAVA-Annotation [Mishra et al., 2024] dataset. Furthermore, we utilize the fine-grained classification of different forms of hallucination as annotated in the FAVA dataset to analyze the efficacy of the detection measures across these different types. In the setting of detection without external references with multiple model responses, we utilize the Self Check GPT dataset [Manakul et al., 2023]. For our evaluations in the setting of hallucination detection with external references assumed available, we primarily consider the RAGTruth dataset [Wu et al., 2023].
Dataset Splits	Yes	In each setting, we consider balanced datasets, with an equal number of samples with and without hallucinations present, except for the Self Check dataset where we follow the setup utilized in the original paper.
Hardware Specification	Yes	We compare the overall runtime cost of the proposed detection scores with other baselines using a Llama-2-7b Chat model on the FAVA-Annotation dataset on a single Nvidia A5000 GPU in Figure-3. We run our experiments mainly on Nvidia A5000 and A6000 GPU cards.
Software Dependencies	Yes	We utilize popular open-source LLM chat-models such as Llama-2-7b Chat [Touvron et al., 2023], Vicuna [Zheng et al., 2023] and Llama-3-instruct [AI@Meta, 2024] as our autoregressive LLMs with their corresponding tokenizers provided by Hugging Face [Wolf et al., 2020]. For all our evaluations, we use Pytorch [25] models.
Experiment Setup	Yes	By default, we set the generation configuration of the Huggingface model to be {"temperature": 0.6, "top_p": 0.9, "top_k": 50, "do_sample": True}. For the setting of Logit Entropy, we consider the top 50 tokens as the selected candidates to compute the score. For the Hidden scores and Attention scores, we report the best of results obtained over all the layers, varying between 1 and 32 for Llama-2-7b and Vicuna-7b.