Truth is Universal: Robust Detection of Lies in LLMs

Authors: Lennart Bürger, Fred A. Hamprecht, Boaz Nadler

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, our proposed classifier achieves state-of-the-art performance, attaining 94% accuracy in both distinguishing true from false factual statements and detecting lies generated in real-world scenarios.
Researcher Affiliation Academia 1 IWR, Heidelberg University, Germany 2 Weizmann Institute of Science, Israel
Pseudocode No The paper describes the procedures for learning truth directions and the TTPD method narratively in sections 3 and 5, but does not include any explicitly labeled "Pseudocode" or "Algorithm" blocks.
Open Source Code Yes The code and datasets for replicating the experiments can be found at https://github.com/sciai-lab/Truth_is_Universal.
Open Datasets Yes To explore the internal truth representation of LLMs, we collected several publicly available, labelled datasets of true and false English statements from previous papers. We then further expanded these datasets to include negated statements, statements with more complex grammatical structures and German statements.
Dataset Splits No The paper mentions training on 80% of the data and evaluating on a 'held-out 20%' or 'full test set', and also uses a 'leave-one-out approach' where excluded datasets are used for testing. However, it does not explicitly describe a separate 'validation' set or split used for hyperparameter tuning or model selection.
Hardware Specification Yes Computing the LLa MA3-8B activations for all statements (~45000) in all datasets took less than two hours using a single Nvidia Quadro RTX 8000 (48 GB) GPU.
Software Dependencies No The paper mentions using specific LLMs (e.g., LLaMA3-8B) and a translation tool (Deep L translator), but does not provide specific version numbers for any programming languages, libraries, or frameworks used for implementing their methods or experiments.
Experiment Setup Yes The input text is first tokenized into a sequence of h tokens... We feed the LLM one statement at a time and extract the residual stream activation vector al Rd in a fixed layer l over the final token of the input statement... For LLa MA3-8B we choose layer 12... The responses are generated by iteratively sampling the next token using the softmax probabilities derived from the model s logits, corresponding to a temperature setting of T = 1.