Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Truth is Universal: Robust Detection of Lies in LLMs
Authors: Lennart Bรผrger, Fred A. Hamprecht, Boaz Nadler
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, our proposed classifier achieves state-of-the-art performance, attaining 94% accuracy in both distinguishing true from false factual statements and detecting lies generated in real-world scenarios. |
| Researcher Affiliation | Academia | 1 IWR, Heidelberg University, Germany 2 Weizmann Institute of Science, Israel |
| Pseudocode | No | The paper describes the procedures for learning truth directions and the TTPD method narratively in sections 3 and 5, but does not include any explicitly labeled "Pseudocode" or "Algorithm" blocks. |
| Open Source Code | Yes | The code and datasets for replicating the experiments can be found at https://github.com/sciai-lab/Truth_is_Universal. |
| Open Datasets | Yes | To explore the internal truth representation of LLMs, we collected several publicly available, labelled datasets of true and false English statements from previous papers. We then further expanded these datasets to include negated statements, statements with more complex grammatical structures and German statements. |
| Dataset Splits | No | The paper mentions training on 80% of the data and evaluating on a 'held-out 20%' or 'full test set', and also uses a 'leave-one-out approach' where excluded datasets are used for testing. However, it does not explicitly describe a separate 'validation' set or split used for hyperparameter tuning or model selection. |
| Hardware Specification | Yes | Computing the LLa MA3-8B activations for all statements (~45000) in all datasets took less than two hours using a single Nvidia Quadro RTX 8000 (48 GB) GPU. |
| Software Dependencies | No | The paper mentions using specific LLMs (e.g., LLaMA3-8B) and a translation tool (Deep L translator), but does not provide specific version numbers for any programming languages, libraries, or frameworks used for implementing their methods or experiments. |
| Experiment Setup | Yes | The input text is first tokenized into a sequence of h tokens... We feed the LLM one statement at a time and extract the residual stream activation vector al Rd in a fixed layer l over the final token of the input statement... For LLa MA3-8B we choose layer 12... The responses are generated by iteratively sampling the next token using the softmax probabilities derived from the model s logits, corresponding to a temperature setting of T = 1. |