COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability
Authors: Xingang Guo, Fangxu Yu, Huan Zhang, Lianhui Qin, Bin Hu
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our extensive experiments on various LLMs (Llama-2, Mistral, Vicuna, Guanaco, GPT-3.5, and GPT-4) show COLD-Attack s broad applicability, strong controllability, high success rate, and attack transferability. Our code is available at https://github.com/Yu-Fangxu/COLD-Attack. In our numerical study, we observe: i) COLD-Attack can efficiently1 generate fluent (suffix) attacks with the continuation constraint and outperform existing methods such as Auto DAN-Zhu in such an existing setting, ii) via novel use of energy functions, COLD-Attack is capable of generating paraphrase attacks2 with or without sentiment control, and iii) COLD-Attack can generate diverse adversarial prompts satisfying the position constraint under various sentiment/lexical/format/style requirements (on outputs). In all the settings, the attacks generated from our method not only exhibit fluency but also adhere to the pre-defined user requirements, supporting our claim that COLD-Attack offers a more versatile and controllable attack strategy. As a preview, Figure 1 provides a few selected samples obtained from our energy-based method to showcase the power of COLD-Attack in all three settings (more examples can be found in Appendix D). We view COLD-Attack as a complement rather than a replacement of existing methods (e.g. GCG, Auto DAN, etc). We hope that our perspective on controllable attacks can inspire more works along this direction. Section 5. Experimental Evaluations |
| Researcher Affiliation | Collaboration | 1University of Illinois at Urbana Champaign 2University of California, San Diego 3Allen Institute for AI. |
| Pseudocode | Yes | A pseudo-code for COLD-Attack is given in Algorithm 1. Algorithm 1 COLD-Attack |
| Open Source Code | Yes | Our code is available at https://github.com/Yu-Fangxu/COLD-Attack. |
| Open Datasets | Yes | For efficient evaluation, we use a subset of Adv Bench introduced in (Zou et al., 2023) to assess COLD-Attack. This dataset comprises 50 instructions designed to solicit harmful content. These instructions are selected from the original dataset to cover a wide range of harmful topics while minimizing duplicates. |
| Dataset Splits | No | The paper describes the AdvBench dataset as the set of instructions used to assess COLD-Attack, functioning as a test set for their attack methodology. It does not describe any train/validation/test splits of this dataset for the purpose of training a model for their own experimental setup. Their work is on generating attacks against existing LLMs using this dataset for evaluation, not on training a new model where such splits would typically be specified. |
| Hardware Specification | Yes | COLD-Attack is on average 10 faster than GCG and GCG-reg: executing COLD-Attack for a single request using a single NVIDIA V100 GPU takes about 20 minutes (with 2000 steps and a batch of 8 samples), while GCG and GCG-reg require approximately 3.23 hours for the same task (with 500 steps and a batch size of 512). We report the detailed running time in Table 4 in the appendix. |
| Software Dependencies | No | The paper does not provide specific version numbers for software dependencies such as programming languages or libraries (e.g., Python, PyTorch, CUDA). |
| Experiment Setup | Yes | We run GCG with a batch size of 512 and a top-k of 256 to generate a single output. The number of suffix tokens is 20 and we run the optimization for 500 steps. We run COLD-Attack with 2000 iterations with step size η = 0.1. In addition, we used the decreased noise schedule as σ = {1,0.5,0.1,0.05,0.01} at iterations n = {0,50,200,500,1500}, respectively. The hyper-parameters used in different settings are listed in Table 11. |