Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

AdvPrefix: An Objective for Nuanced LLM Jailbreaks

Authors: Sicheng Zhu, Brandon Amos, Yuandong Tian, Chuan Guo, Ivan Evtimov

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical findings (Section 5). Integrating Adv Prefix into GCG and Auto DAN [Zhu et al., 2023] significantly increases nuanced ASR across Llama-2, 3, 3.1, and Gemma-2. For instance, GCG s ASR on Llama-3 improves from 16% to 70%, highlighting that current safety alignments struggle to generalize to unseen prefixes.
Researcher Affiliation Collaboration Sicheng Zhu University of Maryland, College Park EMAIL Brandon Amos FAIR, Meta EMAIL Yuandong Tian FAIR, Meta EMAIL Chuan Guo FAIR, Meta EMAIL Ivan Evtimov FAIR, Meta EMAIL
Pseudocode No The paper describes methodologies using mathematical formulations and descriptive text, but it does not contain explicit pseudocode blocks or algorithm figures labeled as such.
Open Source Code Yes Code and selected prefixes are released in github.com/facebookresearch/jailbreak-objectives. Warning: This paper includes language that could be considered inappropriate or harmful.
Open Datasets Yes We curate 50 highly harmful, non-ambiguous requests from Adv Bench as our dataset, and use 800 manually labeled GCG attack responses on Llama-2, 3, 3.1, and Gemma-2 as ground truth (details in Appendix C, nuanced labeling refers to Vidgen et al. [2024], labeled data released in our codebase).
Dataset Splits No The paper states: "We use the 50 malicious requests curated from Adv Bench (Appendix C)", and mentions running attacks for a certain number of steps with a batch size. It does not explicitly define traditional training/test/validation splits for a model, as the 'dataset' here refers to the set of prompts used for evaluation of jailbreak attacks.
Hardware Specification Yes Generating each prefix takes about 5 minutes on an A100 80G GPU (3 minutes with rougher PASR estimation).
Software Dependencies No The paper mentions specific LLMs (Llama-2-7B-chat-hf, Llama-3-8B-Instruct, Llama-3.1-8B-Instruct, Gemma-2-9B-it, Llama3.1-70B) and jailbreak attack methods (GCG, Auto DAN), but does not provide specific version numbers for general software dependencies like Python, PyTorch, or CUDA.
Experiment Setup Yes For each run, we select the attack prompt yielding the lowest objective loss. We test four victim LLMs: Llama-2-7B-chat-hf [Touvron et al., 2023], Llama-3-8BInstruct, Llama-3.1-8B-Instruct [Dubey et al., 2024], and Gemma-2-9B-it [Team et al., 2024]. We use the 50 malicious requests curated from Adv Bench (Appendix C), and run both attacks for 1000 steps with a batch size of 512. We combine the two selection criteria with a fixed weight of 20 for log-prefilling-ASR. We select four prefixes for our multiple-prefix objective. We generate victim LLM responses using greedy decoding and allow output to 512 tokens for nuanced evaluation.