Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses

Authors: Xiaosen Zheng @ SMU, Tianyu Pang, Chao Du, Qian Liu, Jing Jiang, Min Lin

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental For example, our method achieves > 80% (mostly > 95%) ASRs on Llama-2-7B and Llama-3-8B without multiple restarts, even if the models are enhanced by strong defenses such as perplexity detection and/or Smooth LLM, which is challenging for suffix-based jailbreaking. In addition, we conduct comprehensive and elaborate (e.g., making sure to use correct system prompts) evaluations against other aligned LLMs and advanced defenses, where our method consistently achieves nearly 100% ASRs.
Researcher Affiliation Collaboration Xiaosen Zheng 1,2, Tianyu Pang 1, Chao Du1, Qian Liu1, Jing Jiang 2, Min Lin1 1Sea AI Lab, Singapore 2Singapore Management University {zhengxs, tianyupang, duchao, liuqian, linmin}@sea.com; jingjiang@smu.edu.sg
Pseudocode Yes Algorithm 1 Batch demo-level random search; Algorithm 2 Batch demo-level random search for Smooth LLM
Open Source Code Yes Our code is available at https://github.com/sail-sg/I-FSJ.
Open Datasets Yes For the demonstrations (harmful pairs) used in few-shot jailbreaking, we use a Mistral-7B-Instruct-v0.2, an LLM with weaker safety alignment, to craft the harmful content on a set of harmful requests. ... Finally, we create a demonstration pool as D = {(x1, y1), ..., (x520, y520)}.
Dataset Splits Yes Our targets are a collection of 50 harmful behaviors from Adv Bench curated by Chao et al. [9] that ensures distinct and diverse harmful requests. We exclude the demonstrations for the same target harmful behavior from the pool to avoid leakage. ... Figure 4: Ablation study of the effect of pool size and number of shots to I-FSJ on Llama-2-7B-Chat.
Hardware Specification Yes Every experiment is run on a single NVIDIA A100 (40G) GPU within a couple of hours.
Software Dependencies No The paper mentions various LLMs and tools used (e.g., Mistral-7B-Instruct-v0.2, GPT-2, Sentence-BERT, Huggingface transformers), but it does not provide specific version numbers for these software components to ensure reproducibility.
Experiment Setup Yes For the demo-level random search, we set batch size B = 8 and iterations T = 128. We let the target LLMs generate up to 100 new tokens. We use each LLM s default generation config.