Improved Generation of Adversarial Examples Against Safety-aligned LLMs
Authors: Qizhang Li, Yiwen Guo, Wangmeng Zuo, Hao Chen
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our empirical results show that 87% of the query-specific adversarial suffixes generated by the developed combination can induce Llama-2-7B-Chat to produce the output that exactly matches the target string on Adv Bench. |
| Researcher Affiliation | Collaboration | 1Harbin Institute of Technology, 2Tencent Security Big Data Lab, 3Independent Researcher, 4UC Davis |
| Pseudocode | No | The paper describes the methods and adaptations in prose but does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code at: https://github.com/qizhangli/Gradient-based-Jailbreak Attacks. |
| Open Datasets | Yes | We use Adv Bench [58]. |
| Dataset Splits | Yes | For query-specific adversarial suffix generation, following [58], we use first 100 harmful behaviors in Adv Bench. For universal adversarial suffix generation, we use the first 10 harmful queries in Adv Bench to generate a universal adversarial suffix and test it on the rest 510 harmful queries in Adv Bench. |
| Hardware Specification | Yes | Time cost is derived by generating a single adversarial suffix on a single NVIDIA V100 32GB GPU. |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers (e.g., library names like PyTorch or TensorFlow with their corresponding versions). |
| Experiment Setup | Yes | For SGM, we simply set γ = 0.5. For the selection of intermediate representation to perform LILA, we choose the first token from the target phrase... For the intermediate layer, we choose the midpoint of the model layers... We perform 500 iterations for all methods, with the number of adversarial tokens set to 20. |