reproducibilityindex.ai

Stress-Testing Capability Elicitation With Password-Locked Models

Authors: Ryan Greenblatt, Fabien Roger, Dmitrii Krasheninnikov, David Krueger

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper, we investigate the conditions under which fine-tuning-based elicitation suffices to elicit capabilities. To do this, we introduce password-locked models, LLMs fine-tuned such that some of their capabilities are deliberately hidden. Specifically, these LLMs are trained to exhibit these capabilities only when a password is present in the prompt, and to imitate a much weaker LLM otherwise. Password-locked models enable a novel method of evaluating capabilities elicitation methods, by testing whether these password-locked capabilities can be elicited without using the password. We find that a few high-quality demonstrations are often sufficient to fully elicit password-locked capabilities.
Researcher Affiliation	Collaboration	Ryan Greenblatt Redwood Research ryan@rdwrs.com Fabien Roger Redwood Research fabien.d.roger@gmail.com Dmitrii Krasheninnikov University of Cambridge dk655@cam.ac.uk David Krueger University of Cambridge david.scott.krueger@gmail.com
Pseudocode	No	The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code	Yes	Code available at https://github.com/Fabien Roger/sandbagging.
Open Datasets	Yes	Code generation. In our coding task, a model needs to generate solutions to coding problems from APPS (Hendrycks et al., 2021a) and MBPP (Austin et al., 2021) datasets. Math. ...MATH dataset (Hendrycks et al., 2021b). MMLU. ...MMLU multiple-choice problems (Hendrycks et al., 2020).
Dataset Splits	Yes	Each task s dataset is decomposed into a train set for πweak (10% of problems), a train set for πlock (40%), a train set for elicitation experiments (30%), and the validation set (20%).
Hardware Specification	Yes	All experiments were performed using single-GPU training runs on H100 GPUs.
Software Dependencies	Yes	For all experiments, we fine-tune models with Lion (Chen et al., 2023) (which enables us to do full-weight fine-tuning on a single GPU). We rely on pre-trained models from the Deepseek (Bi et al., 2024; Shao et al., 2024) and Pythia (Biderman et al., 2023) families, as well as Mistral-7B (Jiang et al., 2023) and GPT-4 (Open AI et al., 2023).
Experiment Setup	Yes	For locking and supervised fine-tuning unlocking experiments, we use a learning rate of 1.5 10 6 for generation tasks and 1.5 10 7 for our MMLU classification task. We use a linear learning rate warm-up, a weight decay of 0.01. We use 8 locking epochs. For DPO and iterated DPO, we use β = 0.1, and we pick the learning rate with the best average performance among all seeds among {1.5 10 6, 1.5 2 10 6, 1.5 4 10 6} since DPO is more sensitive to hyperparameter tuning in a way we could not easily predict.