Refusal in Language Models Is Mediated by a Single Direction

Authors: Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, Neel Nanda

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we show that refusal is mediated by a one-dimensional subspace, across 13 popular open-source chat models up to 72B parameters in size. Specifically, for each model, we find a single direction such that erasing this direction from the model s residual stream activations prevents it from refusing harmful instructions, while adding this direction elicits refusal on even harmless instructions. Leveraging this insight, we propose a novel white-box jailbreak method that surgically disables refusal with minimal effect on other capabilities. Finally, we mechanistically analyze how adversarial suffixes suppress propagation of the refusal-mediating direction.
Researcher Affiliation Collaboration Andy Arditi Independent Oscar Obeso ETH Zürich Aaquib Syed University of Maryland Daniel Paleka ETH Zürich Nina Panickssery Anthropic Wes Gurnee MIT Neel Nanda
Pseudocode No The paper describes methods and processes using mathematical formulas and textual explanations but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes *Correspondence to andyrdt@gmail.com, obalcells@student.ethz.ch. Code available at https://github.com/andyrdt/refusal_direction.
Open Datasets Yes We construct two datasets: Dharmful, a dataset of harmful instructions drawn from ADVBENCH (Zou et al., 2023b), MALICIOUSINSTRUCT (Huang et al., 2023), TDC2023 (Mazeika et al., 2023, 2024), and HARMBENCH (Mazeika et al., 2024); and Dharmless, a dataset of harmless instructions sampled from ALPACA (Taori et al., 2023).
Dataset Splits Yes Each dataset consists of train and validation splits of 128 and 32 samples, respectively. We apply filtering to ensure that the train and validation splits do not overlap with the evaluation datasets used in 3 and 4.
Hardware Specification Yes Most experiments presented in this paper were run on a cluster of eight NVIDIA RTX A6000 GPUs with 48GB of memory. All experiments on models with 14B parameters are run using a single 48GB memory GPU. For larger models, we use four 48BG memory GPUs in parallel.
Software Dependencies No For our exploratory research, we used Transformer Lens (Nanda and Bloom, 2022). For our experimental pipeline, we use Hugging Face Transformers (Wolf et al., 2020), Py Torch (Paszke et al., 2019), and v LLM (Kwon et al., 2023).
Experiment Setup Yes When generating model completions for evaluation, we always use greedy decoding and a maximum generation length of 512 tokens, as suggested in Mazeika et al. (2024). We use the default chat template for each model family. We then fine-tuned LLAMA-3 8B INSTRUCT on the constructed dataset, applying Lo RA (Hu et al., 2021) with rank=16 and alpha=32 for 4 epochs.