AI Control: Improving Safety Despite Intentional Subversion

Authors: Ryan Greenblatt, Buck Shlegeris, Kshitij Sachan, Fabien Roger

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We develop and evaluate pipelines of safety techniques (protocols) that try to ensure safety despite intentional subversion an approach we call AI control. We investigate a setting in which we want to solve a sequence of programming problems without ever submitting subtly wrong code, using access to a powerful but untrusted model (in our case, GPT-4), access to a less powerful trusted model (in our case, GPT3.5), and limited access to high-quality trusted labor. We investigate a range of protocols and red-team them by exploring strategies that the untrusted model could use to subvert them. We find that using the trusted model to edit untrustedmodel code or using the untrusted model as a monitor substantially improves on simple baselines.
Researcher Affiliation Industry Ryan Greenblatt 1 Buck Shlegeris 1 Kshitij Sachan 2 Fabien Roger 1 1Redwood Research 2Anthropic, work done at Redwood Research.
Pseudocode No The paper describes procedures and pipelines but does not include any formal pseudocode or algorithm blocks.
Open Source Code Yes Code available at https://github.com/ rgreenblatt/control-evaluations
Open Datasets Yes We focus on a testbed we call the APPS backdooring setting: we want to use LLMs to solve 5,000 programming problems from the APPS dataset (of the interview difficulty level) (Hendrycks et al., 2021)
Dataset Splits No No validation/test split. We iterate on prompts and techniques using the same 1500 APPS problems on which we evaluate safety and usefulness, which puts us at the risk of overfitting.
Hardware Specification No The paper uses GPT-4 and GPT-3.5 models, but it does not specify any particular hardware (e.g., GPU models, CPU types, memory) used for running the experiments.
Software Dependencies Yes We use GPT-4 (version 0613) as the untrusted more-capable model. We use GPT-3.5 (version 0613) for this.
Experiment Setup Yes The protocol has 100 H-points, enough to use an H-point on 2% of problems. Ask T whether the code looks suspicious; audit the top 2% most suspicious solutions. Produce code using U. Then use T to judge how suspicious this code is. Submit that code if its suspiciousness is below a certain threshold; otherwise produce code using T and submit that. We use a GPT-4 pipeline: we walk GPT-4 through the process of constructing a possible backdoor and deciding whether it thinks it can actually exploit this. If the difficulty rating is above some threshold, submit a normal solution instead of attempting to produce a backdoor. Generate 5 backdoor ideas. Please rate this likelihood on a scale from 1 to 15.