Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Mitigating Sexual Content Generation via Embedding Distortion in Text-conditioned Diffusion Models
Authors: Jaesin Ahn, Heechul Jung
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | As a result, extensive experiments on explicit content mitigation and adaptive attack defense show that DES achieves state-of-the-art (SOTA) defense, with attack success rate (ASR) of 9.47% on FLUX.1, a recent popular model, and 0.52% on the widely adopted Stable Diffusion v1.5. These correspond to ASR reductions of 76.5% and 63.9% compared to previous SOTA methods, Erase Anything and Adv Unlearn, respectively. Furthermore, DES maintains benign image quality, achieving Fréchet Inception Distance and CLIP score comparable to those of the original FLUX.1 and Stable Diffusion v1.5. |
| Researcher Affiliation | Academia | Jaesin Ahn Department of Artificial Intelligence Kyungpook National University EMAIL Heechul Jung Department of Artificial Intelligence Kyungpook National University EMAIL |
| Pseudocode | Yes | We provide a detailed description of the target vector generation phase in Algorithm 1. ... We present the overall DES training process in Algorithm 2. |
| Open Source Code | Yes | We have attached the data and code used in this paper in the supplementary material. |
| Open Datasets | Yes | For explicit prompts, we use the I2P dataset, which may be created intentionally or unintentionally by users without model access. ... For Sections 4.2 and 4.3, we use 6,911 safe unsafe prompt pairs from the sexual category of Co Pro dataset [26] to train the text encoder. ... Image generation quality is assessed using FID, and text image alignment is evaluated using CLIP score [17], computed on 10k samples from COCO 30k dataset [6]. ... We further validate DES s adversarial defense capabilities using Q16 [38], an alternative NSFW classifier trained on the SMID dataset [8]. |
| Dataset Splits | No | Image generation quality is assessed using FID, and text image alignment is evaluated using CLIP score [17], computed on 10k samples from COCO 30k dataset [6]. For Sections 4.2 and 4.3, we use 6,911 safe unsafe prompt pairs from the sexual category of Co Pro dataset [26] to train the text encoder. For Section 4.4, we additionally use 8,931 prompt pairs from the violence and illegal categories of to cover NSFW categories such as violence, illegal, hate, and others. We also generate 1,600 prompts related to Van Gogh for experiment in Section 4.4. |
| Hardware Specification | Yes | Our experiments were conducted on an NVIDIA DGX A100 (40GB) 8-GPU server running Ubuntu 22.04.4 LTS. |
| Software Dependencies | Yes | We used CUDA 11.8, Py Torch 2.2.1, torchvision 0.17.1, transformers 4.46.0, diffusers 0.29.0, and faiss 1.7.2. |
| Experiment Setup | Yes | We set ̸̸̸̸ = 0.3 and ̸ = 200 to train the text encoders of SDv1.5 and FLUX.1. ... The text encoder was trained for 2 epochs with a learning rate of 1e-5, using the Adam W optimizer and a batch size of 128. |