Many-shot Jailbreaking

Authors: Cem Anil, Esin DURMUS, Nina Panickssery, Mrinank Sharma, Joe Benton, Sandipan Kundu, Joshua Batson, Meg Tong, Jesse Mu, Daniel Ford, Francesco Mosconi, Rajashree Agrawal, Rylan Schaeffer, Naomi Bashkansky, Samuel Svenningsen, Mike Lambert, Ansh Radhakrishnan, Carson Denison, Evan Hubinger, Yuntao Bai, Trenton Bricken, Timothy Maxwell, Nicholas Schiefer, James Sully, Alex Tamkin, Tamera Lanham, Karina Nguyen, Tomek Korbak, Jared Kaplan, Deep Ganguli, Samuel Bowman, Ethan Perez, Roger B. Grosse, David K. Duvenaud

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We investigate a family of simple long-context attacks on large language models: prompting with hundreds of demonstrations of undesirable behavior. ... We demonstrate the success of this attack on the most widely used state-of-the-art closed-weight models, and across various tasks.
Researcher Affiliation Collaboration Cem Anil Esin Durmus Nina Panickssery Mrinank Sharma Joe Benton Sandipan Kundu Joshua Batson Meg Tong Jesse Mu Daniel Ford Fracesco Mosconi Rajashree Agrawal Rylan Schaeffer Naomi Bashkansky Samuel Svenningsen Mike Lambert Ansh Radhakrishnan Carson Denison Evan J Hubinger Yuntao Bai Trenton Bricken Timothy Maxwell Nicholas Schiefer James Sully Alex Tamkin Tamera Lanhan Karina Nguyen Tomasz Korbak Jared Kaplan Deep Ganguli Samuel R. Bowman Ethan Perez Roger Baker Grosse David Duvenaud Correspondance to: cem@anthropic.com
Pseudocode Yes Listing 1: The algorithm used to compute the negative log probabilities reported in our experiments. ... Listing 2: The algorithm used to compute the sample prompts used to compute rates of harmful responses.
Open Source Code No Justification: The experiments in the paper are run on proprietary code, data and models.
Open Datasets Yes Our malevolent personality evaluations are based on the dataset accompanying the work of Perez et al. (2022b), which can be found in the following URL: https://github.com/anthropics/evals/ tree/main/persona.
Dataset Splits No The paper discusses 'Training distribution' and 'Test distribution' but does not explicitly specify a separate 'validation' split or provide exact percentages/counts for all splits.
Hardware Specification No Justification: The experiments in the paper are run on proprietary code, data and models.
Software Dependencies No The paper mentions using "helpful-only model" and Claude 2.0 but does not provide specific version numbers for software dependencies or libraries.
Experiment Setup Yes We did temperature sampling with temperature=1.0. ... We use all the same hyperparameters as in that paper, including using a batch size of 512 prospective new tokens at each step drawn from the top 256 potential token swaps at each position, and optimize over 500 GCG steps.