Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks

Authors: Andy Zhou, Bo Li, Haohan Wang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our theoretical and experimental results show improved robustness to both jailbreaks seen during optimization and unknown jailbreaks, reducing the attack success rate (ASR) on GPT-4 to 6% and Llama-2 to 0% on Jailbreak Bench, setting the state-of-the-art.
Researcher Affiliation Collaboration Andy Zhou1,2 Bo Li1 Haohan Wang1 1University of Illinois Urbana-Champaign 2Lapis Labs
Pseudocode Yes Algorithm 1 Robust Prompt Optimization
Open Source Code Yes Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: Code is provided as supplementary material
Open Datasets Yes We optimize and evaluate our method on the instructions, attack baselines, and defense baselines from two recently proposed red-teaming benchmarks, Harm Bench [Mazeika et al., 2024] and Jailbreak Bench [Chao et al., 2024]. For text-based LLMs, Harm Bench and Jailbreak Bench contain 400 and 100 distinct harmful behaviors, respectively. ... We optimize the suffix using 25 randomly selected instructions from the training set of Adv Bench [Zou et23], to minimize overlap with evaluation instructions.
Dataset Splits No No explicit training/validation/test split percentages or sample counts for validation are mentioned.
Hardware Specification Yes This work used NVIDIA GPUs at NCSA Delta through allocations CIS230218 and CIS230365 from the ACCESS program and from the Illinois Compute Campus Cluster. ... We optimize the RPO suffix using a batch size of 64 and 500 optimization steps, on a single 80GB NVIDIA A100.
Software Dependencies No The paper mentions using Mixtral [Jiang et al., 2024] as an attacker LLM but does not specify version numbers for general software dependencies like Python, PyTorch, or other libraries.
Experiment Setup Yes We use the test cases and jailbreak prompts provided in each benchmark. For GCG, the adversarial suffix is optimized for each individual instruction, for a batch size of 512 and 500 optimization steps. ... For PAIR, the attacker model is Mixtral [Jiang et al., 2024] with a temperature of one, top-p sampling with p = 0.9, N = 30 streams, and a maximum depth of K = 3. ... We optimize the RPO suffix using a batch size of 64 and 500 optimization steps, on a single 80GB NVIDIA A100. We use a selection interval of 50, top-k of 256, and 25 randomly selected instructions from the training set of Adv Bench. The target model is Llama-2-7B-chat.