Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Red-Teaming Text-to-Image Systems by Rule-based Preference Modeling

Authors: Yichuan Cao, Yibo Miao, Xiao-Shan Gao, Yinpeng Dong

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on nineteen T2I systems with varied safety mechanisms, three online commercial API services, and T2V models verify the superiority and practicality of our approach. Our codes are available at: https://github.com/caosip/RPG-RT.
Researcher Affiliation Academia 1State Key Laboratory of Math. Sci., Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China 2University of Chinese Academy of Sciences, Beijing 100049, China 3College of AI, Tsinghua University, Beijing 100084, China 4Shanghai Qi Zhi Institute
Pseudocode No The paper describes the methodology in prose and figures (e.g., Figure 1 for an overview of the RPG-RT framework) but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Our codes are available at: https://github.com/caosip/RPG-RT.
Open Datasets Yes We consider five NSFW categories. For nudity, we select the I2P dataset [58], and choose 95 prompts with nudity above 50%. We also consider the NSFW categories including violence, politicians, discrimination, and copyrights. Details of these datasets are provided in Appendix C.1.
Dataset Splits No While the paper mentions using 95 prompts for nudity and selecting 94 prompts from I2P for transferability experiments, it does not explicitly provide the exact training/test/validation dataset splits (e.g., percentages, sample counts, or specific split methodology with random seeds) for the DPO training of the LLM agent.
Hardware Specification Yes All of the experiments are conducted on Intel(R) Xeon(R) Gold 6430 CPUs and A800 GPUs.
Software Dependencies No The paper mentions using 'Vicuna-7B model [8]' as the base LLM, 'Adam optimizer [26]', 'Lo RA [21]', and 'DPO [53]'. However, it does not provide specific version numbers for software libraries, such as Python, PyTorch, or CUDA.
Experiment Setup Yes For the prompt modification, we choose a high temperature parameter of 1.0 during model sampling, set top-p to 0.6, and apply a repetition penalty of 1.0 to encourage the model to produce more varied and meaningful modified prompts. Additionally, for each original prompt, we perform 30 modifications for each original prompt to ensure sufficient data for preference modeling and fine-tuning. For the scoring model, we select the transformation f as a single-layer linear transformation. To scale the NSFW scores within the range [0, 1], we apply the Sigmoid activation only to the first dimension of the output from the linear layer. During the training of the scoring model, we set the batch size to 16, the learning rate to 1e-4, and use the Adam optimizer [26] for 3000 iterations. For the preference modeling, we set the parameter c to 2 to achieve a balanced trade-off between ASR and semantic preservation, as we show in Appendix C.4. For the LLM agent, we select the unaligned Vicuna-7B model [8] as the base model, as safety-aligned LLMs may reject prompt modifications that generate NSFW semantics. When fine-tuning the LLM agent using direct preference optimization [53] (DPO), we employ Lo RA [21] with a rank of 64 and a dropout rate of 0.05, performing one epoch of fine-tuning on all preference data, and use the Adam [26] optimizer with a learning rate of 2e-4. As a default setting, we perform a 10-round cycle of query feedback and LLM fine-tuning.