reproducibilityindex.ai

AI Control: Improving Safety Despite Intentional Subversion

Authors: Ryan Greenblatt, Buck Shlegeris, Kshitij Sachan, Fabien Roger

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We develop and evaluate pipelines of safety techniques (protocols) that try to ensure safety despite intentional subversion an approach we call AI control. We investigate a setting in which we want to solve a sequence of programming problems without ever submitting subtly wrong code, using access to a powerful but untrusted model (in our case, GPT-4), access to a less powerful trusted model (in our case, GPT3.5), and limited access to high-quality trusted labor. We investigate a range of protocols and red-team them by exploring strategies that the untrusted model could use to subvert them. We find that using the trusted model to edit untrustedmodel code or using the untrusted model as a monitor substantially improves on simple baselines.
Researcher Affiliation	Industry	Ryan Greenblatt 1 Buck Shlegeris 1 Kshitij Sachan 2 Fabien Roger 1 1Redwood Research 2Anthropic, work done at Redwood Research.
Pseudocode	No	The paper describes procedures and pipelines but does not include any formal pseudocode or algorithm blocks.
Open Source Code	Yes	Code available at https://github.com/ rgreenblatt/control-evaluations
Open Datasets	Yes	We focus on a testbed we call the APPS backdooring setting: we want to use LLMs to solve 5,000 programming problems from the APPS dataset (of the interview difficulty level) (Hendrycks et al., 2021)
Dataset Splits	No	No validation/test split. We iterate on prompts and techniques using the same 1500 APPS problems on which we evaluate safety and usefulness, which puts us at the risk of overfitting.
Hardware Specification	No	The paper uses GPT-4 and GPT-3.5 models, but it does not specify any particular hardware (e.g., GPU models, CPU types, memory) used for running the experiments.
Software Dependencies	Yes	We use GPT-4 (version 0613) as the untrusted more-capable model. We use GPT-3.5 (version 0613) for this.
Experiment Setup	Yes	The protocol has 100 H-points, enough to use an H-point on 2% of problems. Ask T whether the code looks suspicious; audit the top 2% most suspicious solutions. Produce code using U. Then use T to judge how suspicious this code is. Submit that code if its suspiciousness is below a certain threshold; otherwise produce code using T and submit that. We use a GPT-4 pipeline: we walk GPT-4 through the process of constructing a possible backdoor and deciding whether it thinks it can actually exploit this. If the difficulty rating is above some threshold, submit a normal solution instead of attempting to produce a backdoor. Generate 5 backdoor ideas. Please rate this likelihood on a scale from 1 to 15.