AI Control: Improving Safety Despite Intentional Subversion
Authors: Ryan Greenblatt, Buck Shlegeris, Kshitij Sachan, Fabien Roger
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We develop and evaluate pipelines of safety techniques (protocols) that try to ensure safety despite intentional subversion an approach we call AI control. We investigate a setting in which we want to solve a sequence of programming problems without ever submitting subtly wrong code, using access to a powerful but untrusted model (in our case, GPT-4), access to a less powerful trusted model (in our case, GPT3.5), and limited access to high-quality trusted labor. We investigate a range of protocols and red-team them by exploring strategies that the untrusted model could use to subvert them. We find that using the trusted model to edit untrustedmodel code or using the untrusted model as a monitor substantially improves on simple baselines. |
| Researcher Affiliation | Industry | Ryan Greenblatt 1 Buck Shlegeris 1 Kshitij Sachan 2 Fabien Roger 1 1Redwood Research 2Anthropic, work done at Redwood Research. |
| Pseudocode | No | The paper describes procedures and pipelines but does not include any formal pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code available at https://github.com/ rgreenblatt/control-evaluations |
| Open Datasets | Yes | We focus on a testbed we call the APPS backdooring setting: we want to use LLMs to solve 5,000 programming problems from the APPS dataset (of the interview difficulty level) (Hendrycks et al., 2021) |
| Dataset Splits | No | No validation/test split. We iterate on prompts and techniques using the same 1500 APPS problems on which we evaluate safety and usefulness, which puts us at the risk of overfitting. |
| Hardware Specification | No | The paper uses GPT-4 and GPT-3.5 models, but it does not specify any particular hardware (e.g., GPU models, CPU types, memory) used for running the experiments. |
| Software Dependencies | Yes | We use GPT-4 (version 0613) as the untrusted more-capable model. We use GPT-3.5 (version 0613) for this. |
| Experiment Setup | Yes | The protocol has 100 H-points, enough to use an H-point on 2% of problems. Ask T whether the code looks suspicious; audit the top 2% most suspicious solutions. Produce code using U. Then use T to judge how suspicious this code is. Submit that code if its suspiciousness is below a certain threshold; otherwise produce code using T and submit that. We use a GPT-4 pipeline: we walk GPT-4 through the process of constructing a possible backdoor and deciding whether it thinks it can actually exploit this. If the difficulty rating is above some threshold, submit a normal solution instead of attempting to produce a backdoor. Generate 5 backdoor ideas. Please rate this likelihood on a scale from 1 to 15. |