Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Training-Free Safe Text Embedding Guidance for Text-to-Image Diffusion Models

Authors: Byeonghu Na, Mina Kang, Jiseok Kwak, Minsang Park, Jiwoo Shin, SeJoon Jun, Gayoung Lee, Jin-Hwa Kim, Il-chul Moon

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on various safety scenarios, including nudity, violence, and artist-style removal, show that STG consistently outperforms both training-based and training-free baselines in removing unsafe content while preserving the core semantic intent of input prompts.
Researcher Affiliation Collaboration 1KAIST, 2NAVER AI Lab, 3SNU AIIS, 4summary.ai EMAIL, EMAIL
Pseudocode Yes Algorithm 1 Diffusion Sampling with STG
Open Source Code Yes Our code is available at https://github.com/aailab-kaist/STG.
Open Datasets Yes We evaluate our method on nudity and violence using both black-box and white-box red-teaming protocols, following [28]. For black-box attacks, we use Ring-A-Bell [43] (95 nudity and 250 violence prompts) and Sneaky Prompt [46] (200 nudity prompts). General generation quality is assessed using zero-shot FID [17] and CLIP score on 3,000 images generated from randomly sampled captions in the COCO validation set, capturing overall image fidelity and text-image alignment. COCO: https://cocodataset.org/#termsofuse Ring-A-Bell: https://github.com/chiayi-hsu/Ring-A-Bell/blob/main/LICENSE Sneaky Prompt: https://github.com/Yuchen413/text2image_safety/blob/main/LICENSE I2P: https://huggingface.co/datasets/AIML-TUDA/i2p
Dataset Splits Yes For black-box attacks, we use Ring-A-Bell [43] (95 nudity and 250 violence prompts) and Sneaky Prompt [46] (200 nudity prompts). General generation quality is assessed using zero-shot FID [17] and CLIP score on 3,000 images generated from randomly sampled captions in the COCO validation set, capturing overall image fidelity and text-image alignment. For the violence task, we adopt Concept Inversion [29], where a special token <c> is learned via textual inversion to bypass safety mechanisms. Following the DUO protocol [28], we use 304 prompts with a Q16 percentage of 0.95 or higher from the I2P benchmark [35],4 in order to generate harmful images.
Hardware Specification Yes Most experiments are conducted on a single NVIDIA A100 GPU with CUDA 11.4.
Software Dependencies Yes Most experiments are conducted on a single NVIDIA A100 GPU with CUDA 11.4. We fix the sampling process using a DDIM sampler [38] with 50 sampling steps and a classifier-free guidance scale of 7.5. When using the DDPM sampler [18], we keep all other settings identical to those of the DDIM sampler. For Pix Art-α [6], we use a Transformer-based architecture with Flan-T5-XXL [7] as the text encoder. Sampling follows the default configuration for this model: a DPM-Solver [23] with 20 steps and a classifier-free guidance scale of 4.5. We compute LPIPS using the implementation provided in the RECE codebase,13 which is based on lpips library18 (version 0.1 with Alex Net).
Experiment Setup Yes Sampling is performed with a DDIM sampler [38] with 50 steps and a classifier-free guidance scale of 7.5. To control the strength of the safety guidance, we adjust the update scale hyperparameter ρ. Additionally, we introduce two hyperparameters, the update threshold τ and the update step ratio γ, to reduce computational cost. In the nudity black-box attack experiment, corresponding to Figures 3a and 3b, we explore the tradeoff between PP and DSR by fixing the update step ratio to γ = 0.8 and varying the hyperparameters (ρ, τ) as follows: {(1.8, 0.01), (1.3, 0.01), (0.5, 0.01), (0.5, 0.03), (0.5, 0.2)}, plotted from left to right. For the COCO evaluation in Table 2, we use the midpoint hyperparameter setting of (ρ, τ) = (0.5, 0.01). In the violence black-box attack experiment, corresponding to Figure 3c, we similarly evaluate the trade-off between PP and DSR by fixing τ = 0.05 and γ = 0.6, while varying ρ over the following values: {3, 2, 1, 0.5, 0.2, 0.1} in left-to-right order.