Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Towards Resilient Safety-driven Unlearning for Diffusion Models against Downstream Fine-tuning

Authors: Boheng Li, Renjie Gu, Junjie Wang, Leyi Qi, Yiming Li, Run Wang, Zhan Qin, Tianwei Zhang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments across a wide range of datasets, fine-tuning methods, and configurations demonstrate that Res Align consistently outperforms prior unlearning approaches in retaining safety, while effectively preserving benign generation capability. Our code and pretrained models are publicly available here.
Researcher Affiliation Academia 1Nanyang Technological University, Singapore 2Central South University, China 3Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, School of Cyber Science and Engineering, Wuhan University, China 4State Key Laboratory of Blockchain and Data Security, Zhejiang University, China
Pseudocode Yes Algorithm 1 GETHYPERGRAD, Algorithm 2 Res Align
Open Source Code Yes Our code and pretrained models are publicly available here. (...) Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [Yes] Justification: Please refer to our code repository for details.
Open Datasets Yes Our code and pretrained models are publicly available here. . Disclaimer: This paper includes AI-generated images containing partially nude human figures and other sensitive content, shown only for research purposes. (...) Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: In our paper, all datasets used in this paper (e.g., Diffusion DB and Dream Bench++) are all open-sourced and available for public access.
Dataset Splits Yes We used three datasets for training Res Align, each serving a distinct objective. First, the harmful dataset Dharmful is used to compute the harmful loss Lharmful. We constructed this dataset by selecting 150 unsafe prompts from existing unsafe datasets (e.g., NSFW-56k [43]). Second, the preservation dataset Dpreserve is used to compute the regularization term R(θ). For this dataset, we selected 140 benign prompts from COCO-Objects and Celeb A-HQ. Finally, to simulate downstream fine-tuning data, we constructed the fine-tuning simulation dataset DFT by randomly sampling 100 prompts from a prompt pool selected from Diffusion DB, NSFW-56k, and Dharmful. (...) To ensure reproducibility and reduce computational overhead, we follow Zhang et al. [83] and adopt their publicly released subset of 10,000 randomly sampled text-image pairs from COCO for evaluation. (...) In our main paper, we focus on the sexual category within I2P, selecting all 931 relevant prompts. (...) filter the dataset by generating images for each prompt and select a subset of 200 prompts that have the highest averaged sexually unsafe probability as rated by MHSC.
Hardware Specification Yes Training is performed on a single NVIDIA RTX A100 GPU until convergence, which typically requires 1 GPU hour. During training, the peak and average memory consumption are 56 GB and 24 GB, respectively.
Software Dependencies No our fine-tuning is based on the official script provided by diffusers13, whose default configuration is full-parameter fine-tuning on the UNet parameters with learning rate of 1 10 5, a batch size of 1, and training step of 200, using Adam W [50] optimizer with default hyperparameters.
Experiment Setup Yes For our meta-learning, the distribution of configurations π(C) includes the learning rates of [1 10 4, 1 10 5, 1 10 6], the steps of [5, 10, 20, 30], the fine-tuning loss (i.e., LFT) of both standard denoising loss and the prior-preserved denoising loss [69], the algorithm of both fullparameter fine-tuning and Lo RA [24], and the optimizer of both SGD and Adam. (...) The outer loop learning rate is set to 2 10 4. (...) our fine-tuning is based on the official script provided by diffusers13, whose default configuration is full-parameter fine-tuning on the UNet parameters with learning rate of 1 10 5, a batch size of 1, and training step of 200, using Adam W [50] optimizer with default hyperparameters.