Learning and Forgetting Unsafe Examples in Large Language Models
Authors: Jiachen Zhao, Zhun Deng, David Madras, James Zou, Mengye Ren
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We explore the behavior of LLMs finetuned on noisy custom data containing unsafe content, represented by datasets that contain biases, toxicity, and harmfulness, finding that while aligned LLMs can readily learn this unsafe content, they also tend to forget it more significantly than other examples when subsequently finetuned on safer content. Drawing inspiration from the discrepancies in forgetting, we introduce the Forget Filter algorithm, which filters unsafe data based on how strong the model s forgetting signal is for that data. We demonstrate that the Forget Filter algorithm ensures safety in customized finetuning without compromising downstream task performance, unlike sequential safety finetuning. |
| Researcher Affiliation | Collaboration | 1University of Massachusetts Amherst 2Columbia University 3Google 4Stanford University 5New York University. |
| Pseudocode | Yes | Algorithm 1 The Forget Filter algorithm |
| Open Source Code | Yes | Code is available at https://github.com/ andotalao24/learn-forget-unsafe-llm. |
| Open Datasets | Yes | We use three datasets, each representing a different notion of safety risk: bias, toxicity, and harmfulness. To study bias, we use the BBQ dataset (Parrish et al., 2022)... To study toxicity, we employ the dataset subsampled from the Pile (Gao et al., 2020) by Korbak et al. (2023)... We also experiment on examples from the Harmful QA dataset (Bhardwaj & Poria, 2023)... Dtask contains question answering data, i.e. SQu AD (Rajpurkar et al., 2016), and instruction tuning data, i.e. Alpaca (Taori et al., 2023), representing useful downstream tasks. |
| Dataset Splits | No | For experiments in Section 2, we construct a noisy dataset of 5000 examples as is discussed in Section 2.1 and sample 7000 safe examples for Safety Finetuning. Bias or toxicity is evaluated on 5000 randomly sampled held-out data. The paper mentions data used for finetuning and held-out data for evaluation, but does not explicitly specify train/validation splits or percentages. |
| Hardware Specification | No | The paper mentions LLMs of different scales (LLa MA 7B, GPT2-XL, GPT2-L, GPT2-M) and uses LoRA for finetuning, but does not specify any concrete hardware details such as GPU models, CPU types, or memory used for the experiments. |
| Software Dependencies | No | The paper mentions using Lo RA (Hu et al., 2022) but does not provide specific version numbers for software dependencies such as Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | For experiments in Section 2, we construct a noisy dataset of 5000 examples as is discussed in Section 2.1 and sample 7000 safe examples for Safety Finetuning. Bias or toxicity is evaluated on 5000 randomly sampled held-out data. We set the learning rate as 2 10 4 and the batch size as 32 to accomodate our computation resources. We use Lo RA (Hu et al., 2022) by default to finetune the full LLa MA-7B unless otherwise specified in this paper. |