Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation

Authors: Yangsibo Huang, Samyak Gupta, Mengzhou Xia, Kai Li, Danqi Chen

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To systematically evaluate our findings, we evaluate our generation exploitation attack on 11 open-source LLMs spanning four different model families (Section 4.2), including LLAMA2 (Touvron et al., 2023b), VICUNA (Chiang et al., 2023), FALCON (Almazrouei et al., 2023), and MPT models (Mosaic ML, 2023). Our experimental results show that our generation exploitation attack can increase the attack success rate to > 95% for 9 out of 11 models.
Researcher Affiliation Academia Yangsibo Huang, Samyak Gupta, Mengzhou Xia, Kai Li, Danqi Chen Computer Science Department & Princeton Language and Intelligence, Princeton University yangsibo@princeton.edu {samyakg,mengzhou,li,danqic}@cs.princeton.edu
Pseudocode Yes Algorithm 1: The generation exploitation attack
Open Source Code Yes Our code is available at https://github.com/Princeton-Sys ML/Jailbreak LLM.
Open Datasets Yes To systematically evaluate the effectiveness of our attack, we primarily use two benchmarks: Adv Bench (Zou et al., 2023)... We use the HH-RLHF dataset (Bai et al., 2022a) to train the classifier.
Dataset Splits Yes The trained classifier achieves 96% accuracy on a validation set.
Hardware Specification Yes launching our attack with a single prompt on LLAMA2-7B-CHAT using a single NVIDIA A100 GPU takes about 3 minutes
Software Dependencies No The paper mentions software tools and models like 'BERT-BASE-CASED model', 'TOXIC-BERT model', and 'Alpaca Farm framework' but does not specify their version numbers.
Experiment Setup Yes Regarding the system prompt, we consider either 1) prepending it before the user instruction, or 2) not including it. In terms of decoding strategies, we experiment with the following three variants: Temperature sampling with varied temperatures τ. Temperature controls the sharpness of the next-token distribution (see Equation (1)), and we vary it from 0.05 to 1 with step size 0.05, which gives us 20 configurations. Top-K sampling filters the K most likely next words... We vary K in {1, 2, 5, 10, 20, 50, 100, 200, 500}, which gives us 9 configurations. Top-p sampling... We vary p from 0.05 to 1 with step size 0.05, which gives us 20 configurations... The fine-tuning process uses a learning rate of 2 10 5 (with cosine learning rate scheduler and a warm-up ratio of 0.3), a batch size of 16, and runs for a total of 3 epochs.