Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Red-Teaming Text-to-Image Systems by Rule-based Preference Modeling
Authors: Yichuan Cao, Yibo Miao, Xiao-Shan Gao, Yinpeng Dong
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on nineteen T2I systems with varied safety mechanisms, three online commercial API services, and T2V models verify the superiority and practicality of our approach. Our codes are available at: https://github.com/caosip/RPG-RT. |
| Researcher Affiliation | Academia | 1State Key Laboratory of Math. Sci., Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China 2University of Chinese Academy of Sciences, Beijing 100049, China 3College of AI, Tsinghua University, Beijing 100084, China 4Shanghai Qi Zhi Institute |
| Pseudocode | No | The paper describes the methodology in prose and figures (e.g., Figure 1 for an overview of the RPG-RT framework) but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our codes are available at: https://github.com/caosip/RPG-RT. |
| Open Datasets | Yes | We consider five NSFW categories. For nudity, we select the I2P dataset [58], and choose 95 prompts with nudity above 50%. We also consider the NSFW categories including violence, politicians, discrimination, and copyrights. Details of these datasets are provided in Appendix C.1. |
| Dataset Splits | No | While the paper mentions using 95 prompts for nudity and selecting 94 prompts from I2P for transferability experiments, it does not explicitly provide the exact training/test/validation dataset splits (e.g., percentages, sample counts, or specific split methodology with random seeds) for the DPO training of the LLM agent. |
| Hardware Specification | Yes | All of the experiments are conducted on Intel(R) Xeon(R) Gold 6430 CPUs and A800 GPUs. |
| Software Dependencies | No | The paper mentions using 'Vicuna-7B model [8]' as the base LLM, 'Adam optimizer [26]', 'Lo RA [21]', and 'DPO [53]'. However, it does not provide specific version numbers for software libraries, such as Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | For the prompt modification, we choose a high temperature parameter of 1.0 during model sampling, set top-p to 0.6, and apply a repetition penalty of 1.0 to encourage the model to produce more varied and meaningful modified prompts. Additionally, for each original prompt, we perform 30 modifications for each original prompt to ensure sufficient data for preference modeling and fine-tuning. For the scoring model, we select the transformation f as a single-layer linear transformation. To scale the NSFW scores within the range [0, 1], we apply the Sigmoid activation only to the first dimension of the output from the linear layer. During the training of the scoring model, we set the batch size to 16, the learning rate to 1e-4, and use the Adam optimizer [26] for 3000 iterations. For the preference modeling, we set the parameter c to 2 to achieve a balanced trade-off between ASR and semantic preservation, as we show in Appendix C.4. For the LLM agent, we select the unaligned Vicuna-7B model [8] as the base model, as safety-aligned LLMs may reject prompt modifications that generate NSFW semantics. When fine-tuning the LLM agent using direct preference optimization [53] (DPO), we employ Lo RA [21] with a rank of 64 and a dropout rate of 0.05, performing one epoch of fine-tuning on all preference data, and use the Adam [26] optimizer with a learning rate of 2e-4. As a default setting, we perform a 10-round cycle of query feedback and LLM fine-tuning. |