Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Perception-Guided Jailbreak Against Text-to-Image Models
Authors: Yihao Huang, Le Liang, Tianlin Li, Xiaojun Jia, Run Wang, Weikai Miao, Geguang Pu, Yang Liu
AAAI 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The experiments conducted on six open-source models and commercial online services with thousands of prompts have verified the effectiveness of PGJ. |
| Researcher Affiliation | Collaboration | 1Nanyang Technological University, Singapore 2East China Normal University, China 3Wuhan University, China 4Key Laboratory of Cyberspace Security, Ministry of Education, China 5Shanghai Trusted Industrial Control Platform Co.,Ltd., China |
| Pseudocode | No | The paper describes the steps for "Unsafe word selection" and "Word substitution" using LLM instructions in prose with examples, but does not present a formally labeled pseudocode or algorithm block. |
| Open Source Code | No | The paper does not contain an explicit statement offering access to the source code for the proposed PGJ method, nor does it provide a link to a code repository. |
| Open Datasets | No | Thus we exploit GPT4 to generate a dataset with 1,000 prompts for five classical NSFW types: discrimination, illegal, pornographic, privacy, and violent. The paper does not provide a link, DOI, or specific repository name for accessing this generated dataset. |
| Dataset Splits | Yes | We select 20 prompts for each NSFW type, a total of 100 prompts. ... Each NSFW type is represented by 200 prompts. |
| Hardware Specification | Yes | All the experiments were run on an Ubuntu system with an NVIDIA A6000 Tensor Core GPU of 48G RAM. |
| Software Dependencies | No | The paper does not mention any specific software dependencies or libraries with their version numbers. |
| Experiment Setup | Yes | Victim T2I Models. We adopt six popular T2I models as the victims of our attack. They are DALL E 2 (Open AI 2021), DALL E 3 (Open AI 2023a), Cogview3 (Zhipu 2024), SDXL (Podell et al. 2023), Tongyiwanxiang (Ali 2023b), and Hunyuan (Tencent 2024). ... Datasets. ... We exploit GPT4 to generate a dataset with 1,000 prompts for five classical NSFW types: discrimination, illegal, pornographic, privacy, and violent. ... Baselines. ... Evaluation metrics. We use four metrics to evaluate the experiment. ❶We use the attack success rate (ASR) metric... ❷We use the semantic consistency (SC) metric... ❸We use prompt perplexity (PPL) as a metric... ❹We use the Inception Score (IS)... |