Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs
Authors: Anselm Paulus, Arman Zharmagambetov, Chuan Guo, Brandon Amos, Yuandong Tian
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results on popular open source Target LLMs show highly competitive results on the Adv Bench and Harm Bench datasets, that also transfer to closed-source black-box LLMs. We also show that training on adversarial suffixes generated by Adv Prompter is a promising strategy for improving the robustness of LLMs to jailbreaking attacks. |
| Researcher Affiliation | Collaboration | Anselm Paulus * 1 2 Arman Zharmagambetov * 3 Chuan Guo 3 Brandon Amos 3 Yuandong Tian 3 1University of Tübingen 2Work done at Meta 3FAIR at Meta. Correspondence to: Anselm Paulus <EMAIL>, Arman Zharmagambetov <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Adv Prompter Train Algorithm 2 Adv Prompter Opt: Generate adversarial target by minimizing Equation (5). Algorithm 3 Adv Prompter Opt-greedy Algorithm 4 Train qθ using PPO. |
| Open Source Code | Yes | Our code is available at http://github.com/ facebookresearch/advprompter. |
| Open Datasets | Yes | We utilize the Adv Bench (Zou et al., 2023) and Harm Bench (Mazeika et al., 2024b) datasets. Adv Bench encompasses 520 instructions with harmful behaviors and their corresponding desired positive responses, divided into a 60/20/20 train/validation/test split. Harm Bench is curated to significantly reduce the semantic overlap between harmful behaviors, which has been reported as a potential problem of Adv Bench in Mazeika et al. (2024b). Harm Bench contains 400 unique textual behaviors, and offers a pre-defined validation (80 behaviors) and test (320 behaviors) split, but does not contain a train split. Therefore, when investigating data-transfer attacks we train our method (and find universal adversarial suffixes for other methods) on the validation set, and evaluate on the test set. |
| Dataset Splits | Yes | Adv Bench encompasses 520 instructions with harmful behaviors and their corresponding desired positive responses, divided into a 60/20/20 train/validation/test split. Harm Bench ... offers a pre-defined validation (80 behaviors) and test (320 behaviors) split, but does not contain a train split. |
| Hardware Specification | Yes | Using the specified hyperparameters, the Adv Prompter Train process averages 16 hours and 12 minutes for 7B Target LLMs, and 20 hours and 4 minutes for 13B Target LLMs, when run on 2 NVIDIA A100 GPUs for training 10 epochs. |
| Software Dependencies | No | The paper mentions using the TRL package (von Werra et al., 2020) for PPO implementation but does not specify a version number for TRL or any other software dependencies. |
| Experiment Setup | Yes | Unless otherwise specified, we set max it = 10, replay buffer size R = 256, batch size 8, max seq len = 30, regularization strength λ = 100 (150 for Llama2-chat), number of candidates k = 48 and beam size b = 4. After each q-step, we update Adv Prompter 8 times with a learning rate of 5e-4 using Lo RA (Hu et al., 2022). We set the rank to 8 and α = 16 for Lo RA updates with other hyperparameters taking default values. For the sampling procedure in Equation (7), we sample from the output logits of Adv Prompter with a temperature parameter of τ = 0.6 and using nucleus sampling with a parameter of top p = 0.01. |