COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability

Authors: Xingang Guo, Fangxu Yu, Huan Zhang, Lianhui Qin, Bin Hu

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our extensive experiments on various LLMs (Llama-2, Mistral, Vicuna, Guanaco, GPT-3.5, and GPT-4) show COLD-Attack s broad applicability, strong controllability, high success rate, and attack transferability. Our code is available at https://github.com/Yu-Fangxu/COLD-Attack. In our numerical study, we observe: i) COLD-Attack can efficiently1 generate fluent (suffix) attacks with the continuation constraint and outperform existing methods such as Auto DAN-Zhu in such an existing setting, ii) via novel use of energy functions, COLD-Attack is capable of generating paraphrase attacks2 with or without sentiment control, and iii) COLD-Attack can generate diverse adversarial prompts satisfying the position constraint under various sentiment/lexical/format/style requirements (on outputs). In all the settings, the attacks generated from our method not only exhibit fluency but also adhere to the pre-defined user requirements, supporting our claim that COLD-Attack offers a more versatile and controllable attack strategy. As a preview, Figure 1 provides a few selected samples obtained from our energy-based method to showcase the power of COLD-Attack in all three settings (more examples can be found in Appendix D). We view COLD-Attack as a complement rather than a replacement of existing methods (e.g. GCG, Auto DAN, etc). We hope that our perspective on controllable attacks can inspire more works along this direction. Section 5. Experimental Evaluations
Researcher Affiliation Collaboration 1University of Illinois at Urbana Champaign 2University of California, San Diego 3Allen Institute for AI.
Pseudocode Yes A pseudo-code for COLD-Attack is given in Algorithm 1. Algorithm 1 COLD-Attack
Open Source Code Yes Our code is available at https://github.com/Yu-Fangxu/COLD-Attack.
Open Datasets Yes For efficient evaluation, we use a subset of Adv Bench introduced in (Zou et al., 2023) to assess COLD-Attack. This dataset comprises 50 instructions designed to solicit harmful content. These instructions are selected from the original dataset to cover a wide range of harmful topics while minimizing duplicates.
Dataset Splits No The paper describes the AdvBench dataset as the set of instructions used to assess COLD-Attack, functioning as a test set for their attack methodology. It does not describe any train/validation/test splits of this dataset for the purpose of training a model for their own experimental setup. Their work is on generating attacks against existing LLMs using this dataset for evaluation, not on training a new model where such splits would typically be specified.
Hardware Specification Yes COLD-Attack is on average 10 faster than GCG and GCG-reg: executing COLD-Attack for a single request using a single NVIDIA V100 GPU takes about 20 minutes (with 2000 steps and a batch of 8 samples), while GCG and GCG-reg require approximately 3.23 hours for the same task (with 500 steps and a batch size of 512). We report the detailed running time in Table 4 in the appendix.
Software Dependencies No The paper does not provide specific version numbers for software dependencies such as programming languages or libraries (e.g., Python, PyTorch, CUDA).
Experiment Setup Yes We run GCG with a batch size of 512 and a top-k of 256 to generate a single output. The number of suffix tokens is 20 and we run the optimization for 500 steps. We run COLD-Attack with 2000 iterations with step size η = 0.1. In addition, we used the decreased noise schedule as σ = {1,0.5,0.1,0.05,0.01} at iterations n = {0,50,200,500,1500}, respectively. The hyper-parameters used in different settings are listed in Table 11.