What Makes and Breaks Safety Fine-tuning? A Mechanistic Study

Authors: Samyak Jain, Ekdeep S Lubana, Kemal Oksuz, Tom Joy, Philip Torr, Amartya Sanyal, Puneet Dokania

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To better understand the underlying factors that make models safe via safety fine-tuning, we design a synthetic data generation framework that captures salient aspects of an unsafe input by modeling the interaction between the task the model is asked to perform (e.g., design ) versus the specific concepts the task is asked to be performed upon (e.g., a cycle vs. a bomb ). Using this, we investigate three well-known safety fine-tuning methods supervised safety fine-tuning, direct preference optimization, and unlearning and provide significant evidence demonstrating that these methods minimally transform MLP weights to specifically align unsafe inputs into its weights null space. This yields a clustering of inputs based on whether the model deems them safe or not. Correspondingly, when an adversarial input (e.g., a jailbreak) is provided, its activations are closer to safer samples, leading to the model processing such an input as if it were safe.
Researcher Affiliation Collaboration Samyak Jain Five AI Ltd. Ekdeep Singh Lubana University of Michigan & CBS, Harvard University Kemal Oksuz Five AI Ltd. Tom Joy Five AI Ltd. Philip H.S. Torr University of Oxford Amartya Sanyal Max Planck Institute for Intelligent Systems & University of Copenhagen Puneet K. Dokania Five AI Ltd. & University of Oxford
Pseudocode No The paper describes various processes like data generation and experimental setups but does not present them in pseudocode or explicitly labeled algorithm blocks.
Open Source Code Yes Code is available at https://github.com/fiveai/understanding_safety_finetuning.
Open Datasets No To systematically study the mechanisms yielded by safety fine-tuning and how adversarially designed inputs circumvent said mechanisms, we design a synthetic data generating process motivated by the framework of jailbreak attacks proposed by Wei et al. (2023) and Carlini et al. (2023). and We generate around 50 prompts corresponding to safe and unsafe samples manually and later augment the corresponding sets with the help of GPT-4 (Achiam et al., 2023) to generate a dataset containing 500 samples corresponding to safe and unsafe prompts each. The paper describes using a *synthetic data generation framework* and manually *generating a dataset* augmented by GPT-4, but does not provide a concrete access link or official name for this specific dataset to be publicly available.
Dataset Splits No We generate around 50 prompts corresponding to safe and unsafe samples manually and later augment the corresponding sets with the help of GPT-4 (Achiam et al., 2023) to generate a dataset containing 500 samples corresponding to safe and unsafe prompts each. We make an evaluation subset of 100 samples from this. and We use 1K samples randomly sampled independently from the PCFG tree for generating the test set. By manipulating the sampling process of text and task tokens as described earlier, we generate the test sets of jailbreak samples as well. Each of these sets contain 1K samples. The paper mentions creating a test set of 1K samples for synthetic data and an evaluation subset of 100 samples for Llama, but does not specify the train/validation/test splits, only the test/evaluation size.
Hardware Specification Yes This stage of combined pre-training and instruction fine-tuning takes over 8 hours on a single RTX A6000 gpu with 48GB memory, on using a batch size of 512. and For all interpretability experiments we use a RTX A4500 gpu with a memory of 20 GB.
Software Dependencies No The paper mentions using 'mingpt models' and 'Llama models' but does not specify versions for software dependencies like Python, PyTorch, TensorFlow, CUDA, or specific libraries used for implementation.
Experiment Setup Yes We use a cosine schedule on learning rate to ensure that a large learning rate is used for pre-training where majority of training focuses on learning the PCFG structure and a small value of learning rate is used for instruction fine-tuning where the major focus is to learn the bijective mappings. We decay the learning rate to 1e 6. We use 100k iterations to perform this training, with a learning rate of 1e 3 and cosine schedule with warmup of 10k iterations. and We perform safety fine-tuning for 10k iterations, using cosine schedule without warmup with two sets of learning rates: 1e 4 and 1e 5 and decay them to 1e 7. We refer to 1e 4 as ηM and 1e 5 as ηS. and We list the optimal values of hyperparameters below: Unlearning (ηM): γ = 0.1; Unlearning (ηS): γ = 0.01 DPO (ηM): β = 0.1, γ = 0.01; DPO (ηS): β = 0.1, γ = 0.002