Truth is Universal: Robust Detection of Lies in LLMs
Authors: Lennart Bürger, Fred A. Hamprecht, Boaz Nadler
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, our proposed classifier achieves state-of-the-art performance, attaining 94% accuracy in both distinguishing true from false factual statements and detecting lies generated in real-world scenarios. |
| Researcher Affiliation | Academia | 1 IWR, Heidelberg University, Germany 2 Weizmann Institute of Science, Israel |
| Pseudocode | No | The paper describes the procedures for learning truth directions and the TTPD method narratively in sections 3 and 5, but does not include any explicitly labeled "Pseudocode" or "Algorithm" blocks. |
| Open Source Code | Yes | The code and datasets for replicating the experiments can be found at https://github.com/sciai-lab/Truth_is_Universal. |
| Open Datasets | Yes | To explore the internal truth representation of LLMs, we collected several publicly available, labelled datasets of true and false English statements from previous papers. We then further expanded these datasets to include negated statements, statements with more complex grammatical structures and German statements. |
| Dataset Splits | No | The paper mentions training on 80% of the data and evaluating on a 'held-out 20%' or 'full test set', and also uses a 'leave-one-out approach' where excluded datasets are used for testing. However, it does not explicitly describe a separate 'validation' set or split used for hyperparameter tuning or model selection. |
| Hardware Specification | Yes | Computing the LLa MA3-8B activations for all statements (~45000) in all datasets took less than two hours using a single Nvidia Quadro RTX 8000 (48 GB) GPU. |
| Software Dependencies | No | The paper mentions using specific LLMs (e.g., LLaMA3-8B) and a translation tool (Deep L translator), but does not provide specific version numbers for any programming languages, libraries, or frameworks used for implementing their methods or experiments. |
| Experiment Setup | Yes | The input text is first tokenized into a sequence of h tokens... We feed the LLM one statement at a time and extract the residual stream activation vector al Rd in a fixed layer l over the final token of the input statement... For LLa MA3-8B we choose layer 12... The responses are generated by iteratively sampling the next token using the softmax probabilities derived from the model s logits, corresponding to a temperature setting of T = 1. |