reproducibilityindex.ai

Representation Noising: A Defence Mechanism Against Harmful Finetuning

Authors: Domenic Rosati, Jan Wehner, Kai Williams, Lukasz Bartoszcze, Robie Gonzales, carsten maple, Subhabrata Majumdar, Hassan Sajjad, Frank Rudzicz

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We provide empirical evidence that the efficacy of our defence lies in its depth : the degree to which information about harmful representations is removed across all layers of the LLM. We present extensive experimental evidence that our method can mitigate training on harmful question-answering and toxic content generation tasks while maintaining the ability to train the model on harmless tasks and preserving LLM capability ( 4).
Researcher Affiliation	Collaboration	1Dalhousie University 2CISPA Helmholtz Center for Information Security 3Swarthmore College 4University of Warwick 5University of Toronto 6Vijil 7Vector Institute for Artificial Intelligence
Pseudocode	Yes	The full implementation of our approach is given below in algorithm 1.
Open Source Code	Yes	Code available at https://github.com/domenicrosati/representation-noising
Open Datasets	Yes	To fine-tune for harmful question-answering, we use the Beaver Tails harmful QA dataset [36] since it is a very large-scale dataset used in other attack literature [32], where the goal is to train an LLM to generate compliant answers to questions belonging to 14 categories of harm such as animal abuse and violent crime. For toxic content generation, we use the Decoding Trust [62] split of Real Toxicity Prompts (RTP) [19] to fine-tune an LLM to generate highly toxic continuations.
Dataset Splits	Yes	For this setting, we train the base model and its post-Rep Noise version using 1 epoch and a learning rate of 8 10 5, using only the training splits of each dataset. We perform evaluations on the test splits of respective datasets.
Hardware Specification	Yes	We primarily used a single node with 4XA100 (80GB VRAM) GPUs for our results. Occasionally we used 4XA40 (40GB VRAM) GPU nodes as well as 1XA100 (40GB VRAM) from Google Colab.
Software Dependencies	No	The paper mentions using Py Torch and Huggingface for training and inference but does not specify their version numbers.
Experiment Setup	Yes	For the Beaver Tails attack, we train Rep Noise for 1 epoch on 10k paired samples (10k harmful questions with harmful answers and 10k of the same questions with safe answers or refusals) using α = 1, β = 0.001, learning rate (LR) 2 10 5. A batch size of 8 is used throughout the experiments. These settings were found after performing a grid search over α {4, 2, 1, 0.5, 0.1}, β {2, 1, 0.1, 0.01, 0.001, 0.0001}, lr {8 10 8, 1 10 5, 2 10 5, 3 10 5, 8 10 5, 1 10 4, 1 10 3}, and 1, 2, and 4 epochs.