Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Transstratal Adversarial Attack: Compromising Multi-Layered Defenses in Text-to-Image Models

Authors: Chunlong Xie, Kangjie Chen, Shangwei Guo, Shudong Zhang, Tianwei Zhang, Tao Xiang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Evaluated across 14 T2I models (e.g., Stable Diffusion, DALL E, and Midjourney) and 17 safety modules, our attack achieves an average attack success rate of 85.6%, surpassing state-of-the-art methods by 73.5%. Our findings challenge the isolated design of safety mechanisms and establish the first benchmark for holistic robustness evaluation in multi-layered safeguarded T2I models.
Researcher Affiliation Collaboration 1Chongqing University, Chongqing, China 2Nanyang Technological University, Singapore 3Huawei Technologies Co., Ltd., Shenzhen, China
Pseudocode No The paper describes the methodology in Section 4 with prose and flow diagrams (Figure 2), but does not include a formally structured pseudocode block or algorithm.
Open Source Code Yes The code can be found in https://github.com/Bluedask/TAA-T2I.
Open Datasets Yes We curate 118 prompts that cannot bypass default safety filters from the nsfw_200 dataset [43], augmented with 72 LLM-generated NSFW prompts (using seed prompts in [42]) that also fail to bypass safety filters. This forms the nsfw_190 dataset, with details in Appendix B.1. For T2I models, we evaluate 14 representative T2I models, with 10 open-sourced ones (SD-v1.4 [32], SD-v1.5 [32], SD-v2.1 [32], SD-XL [27], SDXL-Turbo [34], SD-3 [7], SD3.5 [7], FLUX.1-dev [19], FLUX.1-schnell [19] and Lumina [28]), and 4 commercial T2I services (Dall E-2 [31], Dall E-3 [3], midjourney-6.1 [1] and midjourney-7 [1]). ... We built a new dataset called NSFW-4000, containing 1,000 TAA-generated NSFW images, 1,000 benign images from the COCO dataset [22], and 2,000 harmful images from existing datasets [18].
Dataset Splits No The paper mentions curating the 'nsfw_190 dataset' and creating the 'NSFW-4000' dataset with specific compositions, but it does not provide explicit training, validation, or test splits for any of these datasets for its experiments in the main text.
Hardware Specification Yes Experiments on SD models were conducted on an NVIDIA RTX 3090, and on an NVIDIA A100 for Flux models. Evaluation of results was performed on an A100.
Software Dependencies No The implementation was done in Python, and the framework used for the T2I models incorporates the transformer library.
Experiment Setup Yes We employ three LLMs (gpt-4o [15], o1-mini [17], and gpt-4.1 [26]) to generate word substitutions. For each word queried to the LLM, the candidate list size is fixed at 10. In candidate probability calculation, fupper is configured to 0.8, flower to 0.2, and the temperature parameter T to 1.0. For genetic optimization, we set the population size to 20, maximum generations to 20, initial mutation rate to 0.5 and minimum mutation rate to 0.1.