reproducibilityindex.ai

Learning and Forgetting Unsafe Examples in Large Language Models

Authors: Jiachen Zhao, Zhun Deng, David Madras, James Zou, Mengye Ren

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We explore the behavior of LLMs finetuned on noisy custom data containing unsafe content, represented by datasets that contain biases, toxicity, and harmfulness, finding that while aligned LLMs can readily learn this unsafe content, they also tend to forget it more significantly than other examples when subsequently finetuned on safer content. Drawing inspiration from the discrepancies in forgetting, we introduce the Forget Filter algorithm, which filters unsafe data based on how strong the model s forgetting signal is for that data. We demonstrate that the Forget Filter algorithm ensures safety in customized finetuning without compromising downstream task performance, unlike sequential safety finetuning.
Researcher Affiliation	Collaboration	1University of Massachusetts Amherst 2Columbia University 3Google 4Stanford University 5New York University.
Pseudocode	Yes	Algorithm 1 The Forget Filter algorithm
Open Source Code	Yes	Code is available at https://github.com/ andotalao24/learn-forget-unsafe-llm.
Open Datasets	Yes	We use three datasets, each representing a different notion of safety risk: bias, toxicity, and harmfulness. To study bias, we use the BBQ dataset (Parrish et al., 2022)... To study toxicity, we employ the dataset subsampled from the Pile (Gao et al., 2020) by Korbak et al. (2023)... We also experiment on examples from the Harmful QA dataset (Bhardwaj & Poria, 2023)... Dtask contains question answering data, i.e. SQu AD (Rajpurkar et al., 2016), and instruction tuning data, i.e. Alpaca (Taori et al., 2023), representing useful downstream tasks.
Dataset Splits	No	For experiments in Section 2, we construct a noisy dataset of 5000 examples as is discussed in Section 2.1 and sample 7000 safe examples for Safety Finetuning. Bias or toxicity is evaluated on 5000 randomly sampled held-out data. The paper mentions data used for finetuning and held-out data for evaluation, but does not explicitly specify train/validation splits or percentages.
Hardware Specification	No	The paper mentions LLMs of different scales (LLa MA 7B, GPT2-XL, GPT2-L, GPT2-M) and uses LoRA for finetuning, but does not specify any concrete hardware details such as GPU models, CPU types, or memory used for the experiments.
Software Dependencies	No	The paper mentions using Lo RA (Hu et al., 2022) but does not provide specific version numbers for software dependencies such as Python, PyTorch, or CUDA.
Experiment Setup	Yes	For experiments in Section 2, we construct a noisy dataset of 5000 examples as is discussed in Section 2.1 and sample 7000 safe examples for Safety Finetuning. Bias or toxicity is evaluated on 5000 randomly sampled held-out data. We set the learning rate as 2 10 4 and the batch size as 32 to accomodate our computation resources. We use Lo RA (Hu et al., 2022) by default to finetune the full LLa MA-7B unless otherwise specified in this paper.