reproducibilityindex.ai

Characterizing Truthfulness in Large Language Model Generations with Local Intrinsic Dimension

Authors: Fan Yin, Jayanth Srinivasa, Kai-Wei Chang

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through experiments on four question answering (QA) datasets, we demonstrate the effectiveness of our proposed method.
Researcher Affiliation	Collaboration	1Department of Computer Science, University of California, Los Angeles, LA, U.S.A. 2Cisco Research, U.S.A.
Pseudocode	No	The paper describes the method verbally and with equations but does not include explicit pseudocode or algorithm blocks.
Open Source Code	Yes	Code is available at: https://github.com/fanyin3639/ LID-Hallucination Detection.
Open Datasets	Yes	We consider four generative QA tasks: Trivia QA (Joshi et al., 2017), Co QA (Reddy et al., 2019), Hotpot QA (Yang et al., 2018), and Tydi QA-GP (English) (Clark et al., 2020).
Dataset Splits	Yes	For each of the datasets, we generate outputs for 2,000 samples from the validation sets and test the methods with those samples.
Hardware Specification	No	The paper does not provide specific details on the hardware used for experiments, such as GPU or CPU models.
Software Dependencies	No	The paper does not provide specific software dependency versions (e.g., Python 3.x, PyTorch 1.x) for reproducibility.
Experiment Setup	Yes	For LID-MLE and LID-Geo MLE, we use 500 nearest neighbors when estimating LIDs for all datasets. We follow Kuhn et al. (2022), and set the temperature to be 0.5 and the number of generated samples to be 10. We fine-tune a Llama-2-7B model for 3,000 steps, roughly 3 epochs, on SUPER-NI s training set.