Improved Generation of Adversarial Examples Against Safety-aligned LLMs

Authors: Qizhang Li, Yiwen Guo, Wangmeng Zuo, Hao Chen

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our empirical results show that 87% of the query-specific adversarial suffixes generated by the developed combination can induce Llama-2-7B-Chat to produce the output that exactly matches the target string on Adv Bench.
Researcher Affiliation Collaboration 1Harbin Institute of Technology, 2Tencent Security Big Data Lab, 3Independent Researcher, 4UC Davis
Pseudocode No The paper describes the methods and adaptations in prose but does not include any structured pseudocode or algorithm blocks.
Open Source Code Yes Code at: https://github.com/qizhangli/Gradient-based-Jailbreak Attacks.
Open Datasets Yes We use Adv Bench [58].
Dataset Splits Yes For query-specific adversarial suffix generation, following [58], we use first 100 harmful behaviors in Adv Bench. For universal adversarial suffix generation, we use the first 10 harmful queries in Adv Bench to generate a universal adversarial suffix and test it on the rest 510 harmful queries in Adv Bench.
Hardware Specification Yes Time cost is derived by generating a single adversarial suffix on a single NVIDIA V100 32GB GPU.
Software Dependencies No The paper does not provide specific software dependencies with version numbers (e.g., library names like PyTorch or TensorFlow with their corresponding versions).
Experiment Setup Yes For SGM, we simply set γ = 0.5. For the selection of intermediate representation to perform LILA, we choose the first token from the target phrase... For the intermediate layer, we choose the midpoint of the model layers... We perform 500 iterations for all methods, with the number of adversarial tokens set to 20.