Leveraging Catastrophic Forgetting to Develop Safe Diffusion Models against Malicious Finetuning

Authors: Jiadong Pan, Hongcheng Gao, Zongyu Wu, Taihang Hu, Li Su, Qingming Huang, Liang Li

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we conduct comprehensive experiments to evaluate the effectiveness of our methods, aiming to answer the following research questions: (RQ1) Whether our method leveraging catastrophic forgetting can be used to achieve a safe model? (RQ2) Whether the safe model reinforced by our method can prevent malicious fine-tuning?
Researcher Affiliation Academia Jiadong Pan1,2 , Hongcheng Gao2 , Zongyu Wu3, Taihang Hu4 Li Su2, Qingming Huang1,2, Liang Li1 1 Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, CAS 2 University of Chinese Academy of Sciences 3 The Pennsylvania State University 4 Nankai University
Pseudocode No The paper describes methods using mathematical equations and textual explanations, but no explicitly labeled 'Pseudocode' or 'Algorithm' blocks are present.
Open Source Code No The paper does not provide an explicit link to open-source code for the methodology described in the paper, nor does it state that the code is provided in supplementary materials.
Open Datasets Yes Datasets. To provide a comprehensive evaluation of our method, we use prompts of LAION-5B [44] to generate clean images and harmful prompts generated by Mistral 7B [18] to create harmful images. ... In addition, we use Diffusion DB [48], COCO [26], I2P [42], and Unsafe [32] prompts to test the effectiveness of our model.
Dataset Splits No The paper mentions training and testing data but does not explicitly specify the percentages or counts for training/validation/test splits, nor does it refer to a predefined validation split.
Hardware Specification Yes All experiments are conducted on NVIDIA RTX 3090 GPUs.
Software Dependencies No The paper mentions using specific models like Stable Diffusion (SD) v1.4 and SD v2.1, and Mistral 7B [18], but does not provide a list of specific software dependencies with version numbers (e.g., Python, PyTorch, or other libraries).
Experiment Setup Yes Configurations. Malicious fine-tuning steps of models are set to 20. All of the models are trained for 200 gradient update steps with a learning rate 1e-5 and a batch size of 1. λ, λc, and l are set to 5e-5, 1, and 0 in the training process.