Leveraging Catastrophic Forgetting to Develop Safe Diffusion Models against Malicious Finetuning
Authors: Jiadong Pan, Hongcheng Gao, Zongyu Wu, Taihang Hu, Li Su, Qingming Huang, Liang Li
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we conduct comprehensive experiments to evaluate the effectiveness of our methods, aiming to answer the following research questions: (RQ1) Whether our method leveraging catastrophic forgetting can be used to achieve a safe model? (RQ2) Whether the safe model reinforced by our method can prevent malicious fine-tuning? |
| Researcher Affiliation | Academia | Jiadong Pan1,2 , Hongcheng Gao2 , Zongyu Wu3, Taihang Hu4 Li Su2, Qingming Huang1,2, Liang Li1 1 Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, CAS 2 University of Chinese Academy of Sciences 3 The Pennsylvania State University 4 Nankai University |
| Pseudocode | No | The paper describes methods using mathematical equations and textual explanations, but no explicitly labeled 'Pseudocode' or 'Algorithm' blocks are present. |
| Open Source Code | No | The paper does not provide an explicit link to open-source code for the methodology described in the paper, nor does it state that the code is provided in supplementary materials. |
| Open Datasets | Yes | Datasets. To provide a comprehensive evaluation of our method, we use prompts of LAION-5B [44] to generate clean images and harmful prompts generated by Mistral 7B [18] to create harmful images. ... In addition, we use Diffusion DB [48], COCO [26], I2P [42], and Unsafe [32] prompts to test the effectiveness of our model. |
| Dataset Splits | No | The paper mentions training and testing data but does not explicitly specify the percentages or counts for training/validation/test splits, nor does it refer to a predefined validation split. |
| Hardware Specification | Yes | All experiments are conducted on NVIDIA RTX 3090 GPUs. |
| Software Dependencies | No | The paper mentions using specific models like Stable Diffusion (SD) v1.4 and SD v2.1, and Mistral 7B [18], but does not provide a list of specific software dependencies with version numbers (e.g., Python, PyTorch, or other libraries). |
| Experiment Setup | Yes | Configurations. Malicious fine-tuning steps of models are set to 20. All of the models are trained for 200 gradient update steps with a learning rate 1e-5 and a batch size of 1. λ, λc, and l are set to 5e-5, 1, and 0 in the training process. |