Fast Adversarial Attacks on Language Models In One GPU Minute

Authors: Vinu Sankar Sadasivan, Shoumik Saha, Gaurang Sriramanan, Priyatham Kattakinda, Atoosa Chegini, Soheil Feizi

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In our experiments, we demonstrate various applications of BEAST such as fast jailbreaking, inducing hallucinations, and improving membership inference attacks. Figure 1 shows an overview of our work. In summary, we make the following contributions in our work: In 3, we introduce a novel class of fast beam searchbased algorithm, BEAST , for attacking LMs that can run in one GPU minute. Our attack offers tunable parameters that allow a tradeoff between attack speed, success rate, and adversarial prompt readability. While the existing jailbreaking methods have their own advantages, in 4, we demonstrate that BEAST can perform targeted adversarial attacks to jailbreak a wide range of aligned LMs using just one Nvidia RTX A60001 with 48GB in one minute. We find that BEAST is the state-of-the-art jailbreak attack in this constrained setting. For instance, in just one minute per prompt, we get an attack success rate of 89% on jailbreaking Vicuna-7B-v1.5, while the best baseline method achieves 58%. BEAST can also generate adversarial suffixes that can transfer to unseen prompts ( 4.5) and unseen models ( 4.6). For example, we show that BEAST suffixes optimized using Vicuna models (7B and 13B) can transfer to Mistral-7B, GPT-3.5-Turbo, and GPT-4-Turbo with ASR of 8%, 40%, and 12%, respectively. Our experiments in 5 show that BEAST can be used to perform untargeted adversarial attacks on aligned LMs to elicit hallucinations in them. We perform human studies to measure hallucinations and find that our attacks make LMs generate 15% more incorrect outputs. We also find that the attacked LMs output irrelevant content 22% of the time.
Researcher Affiliation Academia Vinu Sankar Sadasivan 1 Shoumik Saha * 1 Gaurang Sriramanan * 1 Priyatham Kattakinda 2 Atoosa Chegini 1 Soheil Feizi 1 *Equal contribution 1Department of Computer Science 2Department of Electrical & Computer Engineering. Correspondence to: Vinu Sankar Sadasivan <vinu@cs.umd.edu>.
Pseudocode Yes Algorithm 1 BEAST
Open Source Code Yes Our codebase is released on https://github.com/vinusankars/BEAST
Open Datasets Yes We use the Adv Bench Harmful Behaviors dataset introduced in Zou et al. (2023). ... We use the Truthful QA dataset introduced in Lin et al. (2021). ... We use the Wiki MIA dataset introduced in Shi et al. (2023).
Dataset Splits Yes We use the Adv Bench Harmful Behaviors dataset introduced in Zou et al. (2023). ... which we partition as follows: we consider the first twenty user inputs as the train partition, and craft two adversarial suffixes by utilizing ten inputs at a time, and consider the held-out user inputs 21-100 as the test partition.
Hardware Specification Yes For our jailbreak attacks, we use a single Nvidia RTX A6000 GPU 48GB. Although our attacks can run efficiently on one Nvidia RTX A5000 GPU 24GB, we use the 48GB card to accommodate the baselines and perform a fair evaluation.
Software Dependencies No The paper mentions using models from Hugging Face and implementing in Python, but it does not specify exact version numbers for Python, PyTorch, or other relevant libraries and frameworks to ensure reproducibility of the software environment.
Experiment Setup Yes For our attacks, we set the LMs to have a temperature value of 1, and we set k1 = k2 for simplicity. ... We find the attack to run for L = 40 steps to optimize both ASR and the attack speed. ... For 7B parameter models, we use k = 15 for all settings for the best ASR results. For 13B model (Vicuna-13B), we use k = 9 and k = 15, respectively, for attacks with budgets of one GPU minute and two GPU minutes. ... For our untargeted attack, we use k = 9 and L = 40. ... For our privacy attacks, we use k = 5 and L = 25.