When LLM Meets DRL: Advancing Jailbreaking Efficiency via DRL-guided Search
Authors: Xuan Chen, Yuzhou Nie, Wenbo Guo, Xiangyu Zhang
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through extensive experiments, we demonstrate that RLbreaker is much more effective than existing jailbreaking attacks against six state-of-the-art (SOTA) LLMs. We also show that RLbreaker is robust against three SOTA defenses and its trained agents can transfer across different LLMs. We further validate the key design choices of RLbreaker via a comprehensive ablation study. |
| Researcher Affiliation | Academia | Xuan Chen1, Yuzhou Nie2, Wenbo Guo2, Xiangyu Zhang1 1Purdue University 2University of California, Santa Barbara |
| Pseudocode | Yes | Algorithm 1 RLbreaker: Training and Algorithm 2 RLbreaker: Testing |
| Open Source Code | Yes | Code is available at https://github.com/Xuan Chen-xc/RLbreaker. |
| Open Datasets | Yes | Dataset. We select the widely-used Adv Bench dataset [76], which contains 520 harmful questions. |
| Dataset Splits | No | We randomly split it into a 40%/60% training/testing set. |
| Hardware Specification | Yes | We run the experiments using a single NVIDIA A100 GPU with 80GB memory. For experiments of Auto DAN and GCG and all experiments on Vicuna-7b and Vicuna-13b, we use 3 NVIDIA A100 GPUs with 80GB memory and 1 NVIDIA RTX A6000. |
| Software Dependencies | No | The paper mentions various software components and models (e.g., GPT-2, GPT-3.5-turbo, Vicuna-7b, XLM-RoBERTa, vLLM), but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | Specifically, we first design an LLM-facilitated action space that leverages a helper LLM to mutate the current jailbreaking prompt. Our action design enables diverse action variations while constraining the overall policy learning space. We also design a customized reward function that can decide whether the target LLM s response actually answers the input harmful question at each time step. Our reward function provides dense and meaningful rewards that facilitate policy training. Finally, we also customize the widely used PPO algorithm [47] to further reduce the training randomness. The first condition is when the maximum time step T = 5 is reached, and the second is when the agent s reward at time step t, denoted as r(t), is higher than a threshold τ = 0.7. For GCG... we use their standard settings with 1000 iterations and 8 batch size to train the attack. For Llama2-70b-chat and Mixtral-8x7B-Instruct... we limit the iterations to 500... and set batch size to 4. |