reproducibilityindex.ai

Improved Generation of Adversarial Examples Against Safety-aligned LLMs

Authors: Qizhang Li, Yiwen Guo, Wangmeng Zuo, Hao Chen

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our empirical results show that 87% of the query-specific adversarial suffixes generated by the developed combination can induce Llama-2-7B-Chat to produce the output that exactly matches the target string on Adv Bench.
Researcher Affiliation	Collaboration	1Harbin Institute of Technology, 2Tencent Security Big Data Lab, 3Independent Researcher, 4UC Davis
Pseudocode	No	The paper describes the methods and adaptations in prose but does not include any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Code at: https://github.com/qizhangli/Gradient-based-Jailbreak Attacks.
Open Datasets	Yes	We use Adv Bench [58].
Dataset Splits	Yes	For query-specific adversarial suffix generation, following [58], we use first 100 harmful behaviors in Adv Bench. For universal adversarial suffix generation, we use the first 10 harmful queries in Adv Bench to generate a universal adversarial suffix and test it on the rest 510 harmful queries in Adv Bench.
Hardware Specification	Yes	Time cost is derived by generating a single adversarial suffix on a single NVIDIA V100 32GB GPU.
Software Dependencies	No	The paper does not provide specific software dependencies with version numbers (e.g., library names like PyTorch or TensorFlow with their corresponding versions).
Experiment Setup	Yes	For SGM, we simply set γ = 0.5. For the selection of intermediate representation to perform LILA, we choose the first token from the target phrase... For the intermediate layer, we choose the midpoint of the model layers... We perform 500 iterations for all methods, with the number of adversarial tokens set to 20.