Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation
Authors: Yangsibo Huang, Samyak Gupta, Mengzhou Xia, Kai Li, Danqi Chen
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To systematically evaluate our findings, we evaluate our generation exploitation attack on 11 open-source LLMs spanning four different model families (Section 4.2), including LLAMA2 (Touvron et al., 2023b), VICUNA (Chiang et al., 2023), FALCON (Almazrouei et al., 2023), and MPT models (Mosaic ML, 2023). Our experimental results show that our generation exploitation attack can increase the attack success rate to > 95% for 9 out of 11 models. |
| Researcher Affiliation | Academia | Yangsibo Huang, Samyak Gupta, Mengzhou Xia, Kai Li, Danqi Chen Computer Science Department & Princeton Language and Intelligence, Princeton University yangsibo@princeton.edu {samyakg,mengzhou,li,danqic}@cs.princeton.edu |
| Pseudocode | Yes | Algorithm 1: The generation exploitation attack |
| Open Source Code | Yes | Our code is available at https://github.com/Princeton-Sys ML/Jailbreak LLM. |
| Open Datasets | Yes | To systematically evaluate the effectiveness of our attack, we primarily use two benchmarks: Adv Bench (Zou et al., 2023)... We use the HH-RLHF dataset (Bai et al., 2022a) to train the classifier. |
| Dataset Splits | Yes | The trained classifier achieves 96% accuracy on a validation set. |
| Hardware Specification | Yes | launching our attack with a single prompt on LLAMA2-7B-CHAT using a single NVIDIA A100 GPU takes about 3 minutes |
| Software Dependencies | No | The paper mentions software tools and models like 'BERT-BASE-CASED model', 'TOXIC-BERT model', and 'Alpaca Farm framework' but does not specify their version numbers. |
| Experiment Setup | Yes | Regarding the system prompt, we consider either 1) prepending it before the user instruction, or 2) not including it. In terms of decoding strategies, we experiment with the following three variants: Temperature sampling with varied temperatures τ. Temperature controls the sharpness of the next-token distribution (see Equation (1)), and we vary it from 0.05 to 1 with step size 0.05, which gives us 20 configurations. Top-K sampling filters the K most likely next words... We vary K in {1, 2, 5, 10, 20, 50, 100, 200, 500}, which gives us 9 configurations. Top-p sampling... We vary p from 0.05 to 1 with step size 0.05, which gives us 20 configurations... The fine-tuning process uses a learning rate of 2 10 5 (with cosine learning rate scheduler and a warm-up ratio of 0.3), a batch size of 16, and runs for a total of 3 epochs. |