reproducibilityindex.ai

Refusal in Language Models Is Mediated by a Single Direction

Authors: Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, Neel Nanda

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work, we show that refusal is mediated by a one-dimensional subspace, across 13 popular open-source chat models up to 72B parameters in size. Specifically, for each model, we find a single direction such that erasing this direction from the model s residual stream activations prevents it from refusing harmful instructions, while adding this direction elicits refusal on even harmless instructions. Leveraging this insight, we propose a novel white-box jailbreak method that surgically disables refusal with minimal effect on other capabilities. Finally, we mechanistically analyze how adversarial suffixes suppress propagation of the refusal-mediating direction.
Researcher Affiliation	Collaboration	Andy Arditi Independent Oscar Obeso ETH Zürich Aaquib Syed University of Maryland Daniel Paleka ETH Zürich Nina Panickssery Anthropic Wes Gurnee MIT Neel Nanda
Pseudocode	No	The paper describes methods and processes using mathematical formulas and textual explanations but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	*Correspondence to andyrdt@gmail.com, obalcells@student.ethz.ch. Code available at https://github.com/andyrdt/refusal_direction.
Open Datasets	Yes	We construct two datasets: Dharmful, a dataset of harmful instructions drawn from ADVBENCH (Zou et al., 2023b), MALICIOUSINSTRUCT (Huang et al., 2023), TDC2023 (Mazeika et al., 2023, 2024), and HARMBENCH (Mazeika et al., 2024); and Dharmless, a dataset of harmless instructions sampled from ALPACA (Taori et al., 2023).
Dataset Splits	Yes	Each dataset consists of train and validation splits of 128 and 32 samples, respectively. We apply filtering to ensure that the train and validation splits do not overlap with the evaluation datasets used in 3 and 4.
Hardware Specification	Yes	Most experiments presented in this paper were run on a cluster of eight NVIDIA RTX A6000 GPUs with 48GB of memory. All experiments on models with 14B parameters are run using a single 48GB memory GPU. For larger models, we use four 48BG memory GPUs in parallel.
Software Dependencies	No	For our exploratory research, we used Transformer Lens (Nanda and Bloom, 2022). For our experimental pipeline, we use Hugging Face Transformers (Wolf et al., 2020), Py Torch (Paszke et al., 2019), and v LLM (Kwon et al., 2023).
Experiment Setup	Yes	When generating model completions for evaluation, we always use greedy decoding and a maximum generation length of 512 tokens, as suggested in Mazeika et al. (2024). We use the default chat template for each model family. We then fine-tuned LLAMA-3 8B INSTRUCT on the constructed dataset, applying Lo RA (Hu et al., 2021) with rank=16 and alpha=32 for 4 epochs.