Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
BlueSuffix: Reinforced Blue Teaming for Vision-Language Models Against Jailbreak Attacks
Authors: Yunhan Zhao, Xiang Zheng, Lin Luo, Yige Li, Xingjun Ma, Yu-Gang Jiang
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically show on four VLMs (LLa VA, Mini GPT-4, Instruction BLIP, and Gemini) and four safety benchmarks (Harmful Instruction, Adv Bench, MM-Safety Bench, and Red Team-2K) that Blue Suffix outperforms the baseline defenses by a significant margin. |
| Researcher Affiliation | Academia | 1Shanghai Key Lab of Intell. Info. Processing, School of CS, Fudan University 2City University of Hong Kong 3Singapore Management University EMAIL EMAIL; {xiang.zheng}@cityu.edu.hk; {xdliyige}@gmail.com. |
| Pseudocode | Yes | Algorithm 1 Fine-Tuning the Blue-Team Suffix Generator |
| Open Source Code | Yes | Code is available at https://github.com/Vinsonzyh/Blue Suffix. |
| Open Datasets | Yes | We run our experiments on four popular safety benchmarks: Adv Bench (Zou et al., 2023), MM-Safety Bench (Liu et al., 2024c), Red Team-2K (Luo et al., 2024) and Harmful Instructions (Qi et al., 2024). Detailed introductions of the safety benchmarks are provided in the Appendix E. |
| Dataset Splits | No | The blue suffix generator is fine-tuned from a pre-trained GPT-2 using Proximal Policy Optimization (PPO) (Schulman et al., 2017) on hard jailbreak prompts crafted by the BAP attack (Ying et al., 2024) on all 13 jailbreak topics from the MM-Safety Bench. Please note that fine-tuned GPT-2 will be applied to defend other attacks (Img JP, VAA, GCG, and Auto DAN) and other datasets (Red Team-2K, Adv Bench, and Harmful Instructions) to test its generalizability. The fine-tuning batch size is set to 32. The reward given by the LLM judge (i.e., GPT-4o, the judge template is provided in the Appendix D) is 1 if the model s response is benign, 0 otherwise. The fine-tuning can be stopped until the expected safety score exceeds 0.95, for about 300 epochs. |
| Hardware Specification | No | The computations in this research were performed using the CFFF platform of Fudan University. |
| Software Dependencies | No | We fine-tune a GPT-2 model (Radford et al., 2019) for the suffix generator. ... We utilize GPT-4o (Achiam et al., 2023) to achieve the above objective with a rewritten template. As GPT-4o is a commercial model, we also test the open-source model Llama-3-8B-Instruct (AI@Meta, 2024) as the text purifier. |
| Experiment Setup | Yes | The fine-tuning batch size is set to 32. The reward given by the LLM judge (i.e., GPT-4o, the judge template is provided in the Appendix D) is 1 if the model s response is benign, 0 otherwise. The fine-tuning can be stopped until the expected safety score exceeds 0.95, for about 300 epochs. |