Honesty Is the Best Policy: Defining and Mitigating AI Deception
Authors: Francis Ward, Francesca Toni, Francesco Belardinelli, Tom Everitt
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, we empirically ground the theory. First, we show how our graphical criteria can be used to train a non-deceptive reinforcement learning (RL) agent in a toy game from the signalling literature [17]. Then, we demonstrate how to apply our theory to LMs by either prompting or fine-tuning LMs towards goals which incentivize instrumental deception. We show that LMs fine-tuned to be evaluated as truthful are in fact deceptive, and we mitigate this with the path-specific objectives framework.4 Experiments: exhibiting and mitigating deception in RL agents and LMs |
| Researcher Affiliation | Collaboration | Imperial College London; Deep Mind; Corresponding author: francis.ward19@imperial.ac.uk |
| Pseudocode | Yes | We use the PSO algorithm [31] (see algorithm 1 in supp. material). PSO prevents a deceptive policy from being learned by pruning the game graph to prevent certain edges from being used for optimisation, ensuring the graphical criteria for deception are not met.Algorithm 1 Path-Specific Objectives Input: An SCG M = (G, θ), graphical criterion C, policies π i, natural distributions N. Output: PSO-optimal policy πi |
| Open Source Code | No | The paper uses the Open AI API and Hugging Face API for experiments but does not provide a statement about making its own source code publicly available or provide any specific repository links for the methodology described. |
| Open Datasets | Yes | We utilize the Truthful QA data set, which consists of question-answer pairs along with a human evaluation of truthfulness [56]. |
| Dataset Splits | No | The paper mentions using a 'test set of 135 questions' for evaluation, but it does not explicitly provide details about training, validation, or test dataset splits (e.g., percentages or specific counts for all splits). |
| Hardware Specification | Yes | For this experiment we use a personal laptop with the following specs. Hardware model: ASUSTe K COMPUTER INC. Zen Book UX425EA_UX425EA; Memory: 8.0GB; Processer: 11th Gen Intel Core i5-1135G7 @ 2.40GHz 8; Graphics: Mesa Intel Xe Graphics (TGL GT2); Disk capacity: 512.1GB |
| Software Dependencies | No | The paper mentions using 'Open AI API' and 'Hugging Face API' for fine-tuning and RL, but it does not specify version numbers for these APIs or any other software dependencies such as specific Python versions, libraries, or frameworks with their versions. |
| Experiment Setup | Yes | For each fine-tuned LM, we use the same hyperparameters as used for the judge in Lin et al. [56]: Number of epochs: 5 Batch size: 21 Learning rate: 0.1 No data packing |