reproducibilityindex.ai

INSIDE: LLMs' Internal States Retain the Power of Hallucination Detection

Authors: Chao Chen, Kai Liu, Ze Chen, Yi Gu, Yue Wu, Mingyuan Tao, Zhihang Fu, Jieping Ye

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments and ablation studies are performed on several popular LLMs and question-answering (QA) benchmarks, showing the effectiveness of our proposal.
Researcher Affiliation	Collaboration	Chao Chen1, Kai Liu2, Ze Chen1, Yi Gu1, Yue Wu1, Mingyuan Tao1, Zhihang Fu1 , Jieping Ye1 1Alibaba Cloud 2Zhejiang University
Pseudocode	No	The paper describes the proposed method in text and provides a pipeline illustration (Fig. 1) but does not include explicit pseudocode or algorithm blocks.
Open Source Code	No	The paper cites and links to several third-party open-source models and tools (e.g., Hugging Face models, Google Research ROUGE, SelfCheckGPT) that were used in their experiments, but it does not provide an explicit statement or link to the open-source code for their proposed INSIDE framework or Eigen Score implementation.
Open Datasets	Yes	We utilize four widely used question answering (QA) datasets for evaluation, including two open-book conversational QA datasets Co QA (Reddy et al., 2019) and SQu AD (Rajpurkar et al., 2016), as well as two closed-book QA datasets Trivia QA (Joshi et al., 2017) and Natural Questions (NQ) (Kwiatkowski et al., 2019).
Dataset Splits	Yes	We follow Lin et al. (2023) to utilize the development split of Co QA with 7983 QA pairs, the validation split of NQ with 3610 QA pairs and the validation split of the Trivia QA (rc.nocontext subset) with 9,960 deduplicated QA pairs. For the SQu AD dataset, we filter out the QA pairs with their flag is impossible = True, and utilize the subset of the developmentv2.0 split with 5928 QA pairs.
Hardware Specification	Yes	All experiments are performed on NVIDIA-A100 and we set the number of generations to N = 10 through the experiments.
Software Dependencies	No	The paper states, 'Implementation of this work is based on pytorch and transformers libraries,' but does not provide specific version numbers for these software components.
Experiment Setup	Yes	For the hyperparameters that are used for sampling strategies of LLMs decoder, we set temperature to 0.5, top-p to 0.99 and top-k to 5 through the experiments. The number of generations is set to K = 10. For the sentence embedding used in our proposal, we use the last token embedding of the sentence in the middle layer, i.e., the layer index is set to int(L/2). For the regularization term of the covariance matrix, we set α = 0.001. For the memory bank used to conserve token embeddings, we set N = 3000.