Fast Adversarial Attacks on Language Models In One GPU Minute
Authors: Vinu Sankar Sadasivan, Shoumik Saha, Gaurang Sriramanan, Priyatham Kattakinda, Atoosa Chegini, Soheil Feizi
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In our experiments, we demonstrate various applications of BEAST such as fast jailbreaking, inducing hallucinations, and improving membership inference attacks. Figure 1 shows an overview of our work. In summary, we make the following contributions in our work: In 3, we introduce a novel class of fast beam searchbased algorithm, BEAST , for attacking LMs that can run in one GPU minute. Our attack offers tunable parameters that allow a tradeoff between attack speed, success rate, and adversarial prompt readability. While the existing jailbreaking methods have their own advantages, in 4, we demonstrate that BEAST can perform targeted adversarial attacks to jailbreak a wide range of aligned LMs using just one Nvidia RTX A60001 with 48GB in one minute. We find that BEAST is the state-of-the-art jailbreak attack in this constrained setting. For instance, in just one minute per prompt, we get an attack success rate of 89% on jailbreaking Vicuna-7B-v1.5, while the best baseline method achieves 58%. BEAST can also generate adversarial suffixes that can transfer to unseen prompts ( 4.5) and unseen models ( 4.6). For example, we show that BEAST suffixes optimized using Vicuna models (7B and 13B) can transfer to Mistral-7B, GPT-3.5-Turbo, and GPT-4-Turbo with ASR of 8%, 40%, and 12%, respectively. Our experiments in 5 show that BEAST can be used to perform untargeted adversarial attacks on aligned LMs to elicit hallucinations in them. We perform human studies to measure hallucinations and find that our attacks make LMs generate 15% more incorrect outputs. We also find that the attacked LMs output irrelevant content 22% of the time. |
| Researcher Affiliation | Academia | Vinu Sankar Sadasivan 1 Shoumik Saha * 1 Gaurang Sriramanan * 1 Priyatham Kattakinda 2 Atoosa Chegini 1 Soheil Feizi 1 *Equal contribution 1Department of Computer Science 2Department of Electrical & Computer Engineering. Correspondence to: Vinu Sankar Sadasivan <vinu@cs.umd.edu>. |
| Pseudocode | Yes | Algorithm 1 BEAST |
| Open Source Code | Yes | Our codebase is released on https://github.com/vinusankars/BEAST |
| Open Datasets | Yes | We use the Adv Bench Harmful Behaviors dataset introduced in Zou et al. (2023). ... We use the Truthful QA dataset introduced in Lin et al. (2021). ... We use the Wiki MIA dataset introduced in Shi et al. (2023). |
| Dataset Splits | Yes | We use the Adv Bench Harmful Behaviors dataset introduced in Zou et al. (2023). ... which we partition as follows: we consider the first twenty user inputs as the train partition, and craft two adversarial suffixes by utilizing ten inputs at a time, and consider the held-out user inputs 21-100 as the test partition. |
| Hardware Specification | Yes | For our jailbreak attacks, we use a single Nvidia RTX A6000 GPU 48GB. Although our attacks can run efficiently on one Nvidia RTX A5000 GPU 24GB, we use the 48GB card to accommodate the baselines and perform a fair evaluation. |
| Software Dependencies | No | The paper mentions using models from Hugging Face and implementing in Python, but it does not specify exact version numbers for Python, PyTorch, or other relevant libraries and frameworks to ensure reproducibility of the software environment. |
| Experiment Setup | Yes | For our attacks, we set the LMs to have a temperature value of 1, and we set k1 = k2 for simplicity. ... We find the attack to run for L = 40 steps to optimize both ASR and the attack speed. ... For 7B parameter models, we use k = 15 for all settings for the best ASR results. For 13B model (Vicuna-13B), we use k = 9 and k = 15, respectively, for attacks with budgets of one GPU minute and two GPU minutes. ... For our untargeted attack, we use k = 9 and L = 40. ... For our privacy attacks, we use k = 5 and L = 25. |