reproducibilityindex.ai

Honesty Is the Best Policy: Defining and Mitigating AI Deception

Authors: Francis Ward, Francesca Toni, Francesco Belardinelli, Tom Everitt

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Finally, we empirically ground the theory. First, we show how our graphical criteria can be used to train a non-deceptive reinforcement learning (RL) agent in a toy game from the signalling literature [17]. Then, we demonstrate how to apply our theory to LMs by either prompting or fine-tuning LMs towards goals which incentivize instrumental deception. We show that LMs fine-tuned to be evaluated as truthful are in fact deceptive, and we mitigate this with the path-specific objectives framework.4 Experiments: exhibiting and mitigating deception in RL agents and LMs
Researcher Affiliation	Collaboration	Imperial College London; Deep Mind; Corresponding author: francis.ward19@imperial.ac.uk
Pseudocode	Yes	We use the PSO algorithm [31] (see algorithm 1 in supp. material). PSO prevents a deceptive policy from being learned by pruning the game graph to prevent certain edges from being used for optimisation, ensuring the graphical criteria for deception are not met.Algorithm 1 Path-Specific Objectives Input: An SCG M = (G, θ), graphical criterion C, policies π i, natural distributions N. Output: PSO-optimal policy πi
Open Source Code	No	The paper uses the Open AI API and Hugging Face API for experiments but does not provide a statement about making its own source code publicly available or provide any specific repository links for the methodology described.
Open Datasets	Yes	We utilize the Truthful QA data set, which consists of question-answer pairs along with a human evaluation of truthfulness [56].
Dataset Splits	No	The paper mentions using a 'test set of 135 questions' for evaluation, but it does not explicitly provide details about training, validation, or test dataset splits (e.g., percentages or specific counts for all splits).
Hardware Specification	Yes	For this experiment we use a personal laptop with the following specs. Hardware model: ASUSTe K COMPUTER INC. Zen Book UX425EA_UX425EA; Memory: 8.0GB; Processer: 11th Gen Intel Core i5-1135G7 @ 2.40GHz 8; Graphics: Mesa Intel Xe Graphics (TGL GT2); Disk capacity: 512.1GB
Software Dependencies	No	The paper mentions using 'Open AI API' and 'Hugging Face API' for fine-tuning and RL, but it does not specify version numbers for these APIs or any other software dependencies such as specific Python versions, libraries, or frameworks with their versions.
Experiment Setup	Yes	For each fine-tuned LM, we use the same hyperparameters as used for the judge in Lin et al. [56]: Number of epochs: 5 Batch size: 21 Learning rate: 0.1 No data packing