Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

One Model Transfer to All: On Robust Jailbreak Prompts Generation against LLMs

Authors: Linbao Li, Yannan Liu, Daojing He, YU LI

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive evaluations reveal that Arr Attack significantly outperforms existing attack strategies, demonstrating strong transferability across both white-box and black-box models, including GPT-4 and Claude-3. Our work bridges the gap between jailbreak attacks and defenses, providing a fresh perspective on generating robust jailbreak prompts.
Researcher Affiliation Collaboration 1Harbin Institute of Technology, Shenzhen 2Wuheng Lab, Byte Dance 3Zhejiang University
Pseudocode No The paper includes figures illustrating the method's overview and components, but no explicitly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code Yes 1We make the codebase available at https://github.com/LLBao/Arr Attack.
Open Datasets Yes Our experiments use three datasets: Adv Bench introduced by Zou et al. (2023), Harm Bench introduced by Mazeika et al. (2024), and JBB-Behaviors introduced by Chao et al. (2024).
Dataset Splits Yes The filtered dataset is then divided into three subsets. The first subset, containing 150 instances, is used in Section 3.3. The second subset, containing 579 instances, is used in Section 3.4. The final subset, containing 196 instances, is used for the comparison of our experimental results. We ensure that the first subset does not overlap with the second, and the second subset does not overlap with the third.
Hardware Specification Yes Specifically, our setup requires only a single 80G A800 GPU and approximately five GPU hours, making it a feasible approach.
Software Dependencies No The paper mentions specific models like Llama2-7b, RoBERTa, GPT-4, T5 base, and all-mpnet-base-v2, and tools like GPTFuzz, but it does not provide explicit version numbers for general software libraries or frameworks (e.g., Python, PyTorch, CUDA) as required.
Experiment Setup Yes Hyperparameters: For Arr Attack, we define each attack attempt as the process of generating a single jailbreak prompt. We establish the maximum number of attack attempts as 50 for Guanaco-7b and Vicuna-7b, while for Llama2-7b-chat, we set it to 200. During each attack attempt, the generation model produces a new prompt that is evaluated for its success in bypassing the target model s defenses. If the prompt successfully induces the model to output a harmful response, the attack is considered successful. Otherwise, the process iterates, generating new variations of the prompt until either a successful jailbreak occurs or the maximum number of attempts is reached. The decoding strategy for the generation model uses joint decoding, with top-p set to 0.9 and temperature set to 0.8. Unless explicitly stated otherwise, these configurations will be maintained in subsequent experiments. ... Table 6: Hyperparameters for the Robustness Judgment Model and the Prompt Generation Model. learning rate 2e-5, weight decay 1e-4, num train epochs 8, per device train batch size 6, gradient accumulation steps 2, gradient checkpointing True, optim paged adamw 32bit, bf16 True, tf32 True, max grad norm 0.3, warmup ratio 0.03.