Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

AdvPrefix: An Objective for Nuanced LLM Jailbreaks

Authors: Sicheng Zhu, Brandon Amos, Yuandong Tian, Chuan Guo, Ivan Evtimov

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical findings (Section 5). Integrating Adv Prefix into GCG and Auto DAN [Zhu et al., 2023] significantly increases nuanced ASR across Llama-2, 3, 3.1, and Gemma-2. For instance, GCG s ASR on Llama-3 improves from 16% to 70%, highlighting that current safety alignments struggle to generalize to unseen prefixes.
Researcher Affiliation	Collaboration	Sicheng Zhu University of Maryland, College Park EMAIL Brandon Amos FAIR, Meta EMAIL Yuandong Tian FAIR, Meta EMAIL Chuan Guo FAIR, Meta EMAIL Ivan Evtimov FAIR, Meta EMAIL
Pseudocode	No	The paper describes methodologies using mathematical formulations and descriptive text, but it does not contain explicit pseudocode blocks or algorithm figures labeled as such.
Open Source Code	Yes	Code and selected prefixes are released in github.com/facebookresearch/jailbreak-objectives. Warning: This paper includes language that could be considered inappropriate or harmful.
Open Datasets	Yes	We curate 50 highly harmful, non-ambiguous requests from Adv Bench as our dataset, and use 800 manually labeled GCG attack responses on Llama-2, 3, 3.1, and Gemma-2 as ground truth (details in Appendix C, nuanced labeling refers to Vidgen et al. [2024], labeled data released in our codebase).
Dataset Splits	No	The paper states: "We use the 50 malicious requests curated from Adv Bench (Appendix C)", and mentions running attacks for a certain number of steps with a batch size. It does not explicitly define traditional training/test/validation splits for a model, as the 'dataset' here refers to the set of prompts used for evaluation of jailbreak attacks.
Hardware Specification	Yes	Generating each prefix takes about 5 minutes on an A100 80G GPU (3 minutes with rougher PASR estimation).
Software Dependencies	No	The paper mentions specific LLMs (Llama-2-7B-chat-hf, Llama-3-8B-Instruct, Llama-3.1-8B-Instruct, Gemma-2-9B-it, Llama3.1-70B) and jailbreak attack methods (GCG, Auto DAN), but does not provide specific version numbers for general software dependencies like Python, PyTorch, or CUDA.
Experiment Setup	Yes	For each run, we select the attack prompt yielding the lowest objective loss. We test four victim LLMs: Llama-2-7B-chat-hf [Touvron et al., 2023], Llama-3-8BInstruct, Llama-3.1-8B-Instruct [Dubey et al., 2024], and Gemma-2-9B-it [Team et al., 2024]. We use the 50 malicious requests curated from Adv Bench (Appendix C), and run both attacks for 1000 steps with a batch size of 512. We combine the two selection criteria with a fixed weight of 20 for log-prefilling-ASR. We select four prefixes for our multiple-prefix objective. We generate victim LLM responses using greedy decoding and allow output to 512 tokens for nuanced evaluation.