Honesty Is the Best Policy: Defining and Mitigating AI Deception

Authors: Francis Ward, Francesca Toni, Francesco Belardinelli, Tom Everitt

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finally, we empirically ground the theory. First, we show how our graphical criteria can be used to train a non-deceptive reinforcement learning (RL) agent in a toy game from the signalling literature [17]. Then, we demonstrate how to apply our theory to LMs by either prompting or fine-tuning LMs towards goals which incentivize instrumental deception. We show that LMs fine-tuned to be evaluated as truthful are in fact deceptive, and we mitigate this with the path-specific objectives framework.4 Experiments: exhibiting and mitigating deception in RL agents and LMs
Researcher Affiliation Collaboration Imperial College London; Deep Mind; Corresponding author: francis.ward19@imperial.ac.uk
Pseudocode Yes We use the PSO algorithm [31] (see algorithm 1 in supp. material). PSO prevents a deceptive policy from being learned by pruning the game graph to prevent certain edges from being used for optimisation, ensuring the graphical criteria for deception are not met.Algorithm 1 Path-Specific Objectives Input: An SCG M = (G, θ), graphical criterion C, policies π i, natural distributions N. Output: PSO-optimal policy πi
Open Source Code No The paper uses the Open AI API and Hugging Face API for experiments but does not provide a statement about making its own source code publicly available or provide any specific repository links for the methodology described.
Open Datasets Yes We utilize the Truthful QA data set, which consists of question-answer pairs along with a human evaluation of truthfulness [56].
Dataset Splits No The paper mentions using a 'test set of 135 questions' for evaluation, but it does not explicitly provide details about training, validation, or test dataset splits (e.g., percentages or specific counts for all splits).
Hardware Specification Yes For this experiment we use a personal laptop with the following specs. Hardware model: ASUSTe K COMPUTER INC. Zen Book UX425EA_UX425EA; Memory: 8.0GB; Processer: 11th Gen Intel Core i5-1135G7 @ 2.40GHz 8; Graphics: Mesa Intel Xe Graphics (TGL GT2); Disk capacity: 512.1GB
Software Dependencies No The paper mentions using 'Open AI API' and 'Hugging Face API' for fine-tuning and RL, but it does not specify version numbers for these APIs or any other software dependencies such as specific Python versions, libraries, or frameworks with their versions.
Experiment Setup Yes For each fine-tuned LM, we use the same hyperparameters as used for the judge in Lin et al. [56]: Number of epochs: 5 Batch size: 21 Learning rate: 0.1 No data packing