Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Best-of-N Jailbreaking

Authors: John Hughes, Sara Price, Aengus Lynch, Rylan Schaeffer, Fazl Barez, Arushi Somani, Sanmi Koyejo, Henry Sleight, Erik Jones, Ethan Perez, Mrinank Sharma

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We introduce Best-of-N (Bo N) Jailbreaking, a simple black-box algorithm that jailbreaks frontier AI systems across modalities. Bo N Jailbreaking works by repeatedly sampling variations of a prompt with a combination of augmentations such as random shuffling or capitalization for textual prompts until a harmful response is elicited. We find that Bo N Jailbreaking achieves high attack success rates (ASRs) on closed-source language models, such as 89% on GPT-4o and 78% on Claude 3.5 Sonnet when sampling 10,000 augmented prompts. Further, it is similarly effective at circumventing state-of-the-art open-source defenses like circuit breakers and reasoning models like o1. Bo N also seamlessly extends to other modalities: it jailbreaks vision language models (VLMs) such as GPT-4o and audio language models (ALMs) like Gemini 1.5 Pro, using modality-specific augmentations. Bo N reliably improves when we sample more augmented prompts. Across all modalities, ASR, as a function of the number of samples (N), empirically follows power-law-like behavior for many orders of magnitude.
Researcher Affiliation Collaboration John Hughes Anthropic Sara Price Anthropic Aengus Lynch UCL Rylan Schaeffer Stanford University Fazl Barez University of Oxford Arushi Somani Anthropic Sanmi Koyejo Stanford University Henry Sleight Constellation Erik Jones Anthropic Ethan Perez+ Anthropic Mrinank Sharma+ Anthropic
Pseudocode Yes Algorithm 1 Pre PAIR Require: Batch of requests R = {r1, r2, . . . , rn}, initial prefix p0, target model MT , classifier model MC, red-teaming model MR for i in {1, .., max steps} do
Open Source Code Yes See our website https://jplhughes.github.io/bon-jailbreaking
Open Datasets Yes We use 159 direct requests from the standard Harm Bench test dataset (Mazeika et al., 2024) that exclude copyright and contextual behaviors. We use the following noise, music, and speech files contained in the Musan Snyder et al. (2015) data zip file for all Bo N jailbreaking runs. We first select harmful requests from Adv Bench (Chen et al., 2022) that have no overlap with Harm Bench.
Dataset Splits Yes We use 159 direct requests from the standard Harm Bench test dataset (Mazeika et al., 2024) that exclude copyright and contextual behaviors. Train set contains 50 PAIR, 50 TAP, and 75 direct requests. It is used for optimizing a universal jailbreak across as many requests as possible. Test set contains the same number as the train set and is used to understand how universal attacks transfer to new requests.
Hardware Specification No The paper does not explicitly describe the specific hardware used (e.g., GPU models, CPU types, memory amounts) for running its experiments. While a cost analysis is provided in Appendix E.3, it focuses on the cost of API calls rather than detailed hardware specifications for the experimental setup.
Software Dependencies No The paper mentions various software components and libraries, such as the "Linux SoX package", "wavaugment Kharitonov et al. (2020)", "Kaldi’s Povey et al. (2011) wavreverberate", "requests", "Beautiful Soup", "matplotlib", "cairosvg", "PIL", and "cryptography". However, it does not provide specific version numbers for these software dependencies or the programming language (Python) used, which are necessary for reproducible descriptions.
Experiment Setup Yes We use three text augmentations, namely, character scrambling, random capitalization, and character noising (Figure 2, top left; Appendix C.1) with N = 10,0003 and sampling temperature = 1. We use N = 7,200 and temperature = 1. We use N = 7,200 and temperature 1. We use the attacking LLM at a temperature of 0.8 and the target LLM at a temperature of 1. We use batch size=4