Representation Noising: A Defence Mechanism Against Harmful Finetuning
Authors: Domenic Rosati, Jan Wehner, Kai Williams, Lukasz Bartoszcze, Robie Gonzales, carsten maple, Subhabrata Majumdar, Hassan Sajjad, Frank Rudzicz
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We provide empirical evidence that the efficacy of our defence lies in its depth : the degree to which information about harmful representations is removed across all layers of the LLM. We present extensive experimental evidence that our method can mitigate training on harmful question-answering and toxic content generation tasks while maintaining the ability to train the model on harmless tasks and preserving LLM capability ( 4). |
| Researcher Affiliation | Collaboration | 1Dalhousie University 2CISPA Helmholtz Center for Information Security 3Swarthmore College 4University of Warwick 5University of Toronto 6Vijil 7Vector Institute for Artificial Intelligence |
| Pseudocode | Yes | The full implementation of our approach is given below in algorithm 1. |
| Open Source Code | Yes | Code available at https://github.com/domenicrosati/representation-noising |
| Open Datasets | Yes | To fine-tune for harmful question-answering, we use the Beaver Tails harmful QA dataset [36] since it is a very large-scale dataset used in other attack literature [32], where the goal is to train an LLM to generate compliant answers to questions belonging to 14 categories of harm such as animal abuse and violent crime. For toxic content generation, we use the Decoding Trust [62] split of Real Toxicity Prompts (RTP) [19] to fine-tune an LLM to generate highly toxic continuations. |
| Dataset Splits | Yes | For this setting, we train the base model and its post-Rep Noise version using 1 epoch and a learning rate of 8 10 5, using only the training splits of each dataset. We perform evaluations on the test splits of respective datasets. |
| Hardware Specification | Yes | We primarily used a single node with 4XA100 (80GB VRAM) GPUs for our results. Occasionally we used 4XA40 (40GB VRAM) GPU nodes as well as 1XA100 (40GB VRAM) from Google Colab. |
| Software Dependencies | No | The paper mentions using Py Torch and Huggingface for training and inference but does not specify their version numbers. |
| Experiment Setup | Yes | For the Beaver Tails attack, we train Rep Noise for 1 epoch on 10k paired samples (10k harmful questions with harmful answers and 10k of the same questions with safe answers or refusals) using α = 1, β = 0.001, learning rate (LR) 2 10 5. A batch size of 8 is used throughout the experiments. These settings were found after performing a grid search over α {4, 2, 1, 0.5, 0.1}, β {2, 1, 0.1, 0.01, 0.001, 0.0001}, lr {8 10 8, 1 10 5, 2 10 5, 3 10 5, 8 10 5, 1 10 4, 1 10 3}, and 1, 2, and 4 epochs. |