Stress-Testing Capability Elicitation With Password-Locked Models
Authors: Ryan Greenblatt, Fabien Roger, Dmitrii Krasheninnikov, David Krueger
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we investigate the conditions under which fine-tuning-based elicitation suffices to elicit capabilities. To do this, we introduce password-locked models, LLMs fine-tuned such that some of their capabilities are deliberately hidden. Specifically, these LLMs are trained to exhibit these capabilities only when a password is present in the prompt, and to imitate a much weaker LLM otherwise. Password-locked models enable a novel method of evaluating capabilities elicitation methods, by testing whether these password-locked capabilities can be elicited without using the password. We find that a few high-quality demonstrations are often sufficient to fully elicit password-locked capabilities. |
| Researcher Affiliation | Collaboration | Ryan Greenblatt Redwood Research ryan@rdwrs.com Fabien Roger Redwood Research fabien.d.roger@gmail.com Dmitrii Krasheninnikov University of Cambridge dk655@cam.ac.uk David Krueger University of Cambridge david.scott.krueger@gmail.com |
| Pseudocode | No | The paper does not contain any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | Yes | Code available at https://github.com/Fabien Roger/sandbagging. |
| Open Datasets | Yes | Code generation. In our coding task, a model needs to generate solutions to coding problems from APPS (Hendrycks et al., 2021a) and MBPP (Austin et al., 2021) datasets. Math. ...MATH dataset (Hendrycks et al., 2021b). MMLU. ...MMLU multiple-choice problems (Hendrycks et al., 2020). |
| Dataset Splits | Yes | Each task s dataset is decomposed into a train set for πweak (10% of problems), a train set for πlock (40%), a train set for elicitation experiments (30%), and the validation set (20%). |
| Hardware Specification | Yes | All experiments were performed using single-GPU training runs on H100 GPUs. |
| Software Dependencies | Yes | For all experiments, we fine-tune models with Lion (Chen et al., 2023) (which enables us to do full-weight fine-tuning on a single GPU). We rely on pre-trained models from the Deepseek (Bi et al., 2024; Shao et al., 2024) and Pythia (Biderman et al., 2023) families, as well as Mistral-7B (Jiang et al., 2023) and GPT-4 (Open AI et al., 2023). |
| Experiment Setup | Yes | For locking and supervised fine-tuning unlocking experiments, we use a learning rate of 1.5 10 6 for generation tasks and 1.5 10 7 for our MMLU classification task. We use a linear learning rate warm-up, a weight decay of 0.01. We use 8 locking epochs. For DPO and iterated DPO, we use β = 0.1, and we pick the learning rate with the best average performance among all seeds among {1.5 10 6, 1.5 2 10 6, 1.5 4 10 6} since DPO is more sensitive to hyperparameter tuning in a way we could not easily predict. |