When LLM Meets DRL: Advancing Jailbreaking Efficiency via DRL-guided Search

Authors: Xuan Chen, Yuzhou Nie, Wenbo Guo, Xiangyu Zhang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through extensive experiments, we demonstrate that RLbreaker is much more effective than existing jailbreaking attacks against six state-of-the-art (SOTA) LLMs. We also show that RLbreaker is robust against three SOTA defenses and its trained agents can transfer across different LLMs. We further validate the key design choices of RLbreaker via a comprehensive ablation study.
Researcher Affiliation Academia Xuan Chen1, Yuzhou Nie2, Wenbo Guo2, Xiangyu Zhang1 1Purdue University 2University of California, Santa Barbara
Pseudocode Yes Algorithm 1 RLbreaker: Training and Algorithm 2 RLbreaker: Testing
Open Source Code Yes Code is available at https://github.com/Xuan Chen-xc/RLbreaker.
Open Datasets Yes Dataset. We select the widely-used Adv Bench dataset [76], which contains 520 harmful questions.
Dataset Splits No We randomly split it into a 40%/60% training/testing set.
Hardware Specification Yes We run the experiments using a single NVIDIA A100 GPU with 80GB memory. For experiments of Auto DAN and GCG and all experiments on Vicuna-7b and Vicuna-13b, we use 3 NVIDIA A100 GPUs with 80GB memory and 1 NVIDIA RTX A6000.
Software Dependencies No The paper mentions various software components and models (e.g., GPT-2, GPT-3.5-turbo, Vicuna-7b, XLM-RoBERTa, vLLM), but does not provide specific version numbers for these or other software dependencies.
Experiment Setup Yes Specifically, we first design an LLM-facilitated action space that leverages a helper LLM to mutate the current jailbreaking prompt. Our action design enables diverse action variations while constraining the overall policy learning space. We also design a customized reward function that can decide whether the target LLM s response actually answers the input harmful question at each time step. Our reward function provides dense and meaningful rewards that facilitate policy training. Finally, we also customize the widely used PPO algorithm [47] to further reduce the training randomness. The first condition is when the maximum time step T = 5 is reached, and the second is when the agent s reward at time step t, denoted as r(t), is higher than a threshold τ = 0.7. For GCG... we use their standard settings with 1000 iterations and 8 batch size to train the attack. For Llama2-70b-chat and Mixtral-8x7B-Instruct... we limit the iterations to 500... and set batch size to 4.