reproducibilityindex.ai

Jailbroken: How Does LLM Safety Training Fail?

Authors: Alexander Wei, Nika Haghtalab, Jacob Steinhardt

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We then conduct an empirical evaluation of state-of-the-art safety-trained models, including Open AI s GPT-4 and Anthropic s Claude v1.3, against both existing and newly constructed jailbreak attacks. We evaluate on both a curated dataset of harmful prompts from these models red-teaming evaluation sets and a larger synthetic dataset of harmful prompts for broader coverage.
Researcher Affiliation	Academia	Alexander Wei UC Berkeley awei@berkeley.edu Nika Haghtalab UC Berkeley nika@berkeley.edu Jacob Steinhardt UC Berkeley jsteinhardt@berkeley.edu
Pseudocode	No	The paper describes attack designs and evaluation procedures in natural language and figures, but it does not include any formal pseudocode or algorithm blocks.
Open Source Code	No	Our code and data are available to researchers upon request.
Open Datasets	Yes	The curated dataset consists of all 16 examples of harmful prompts used to evaluate GPT-4 from its report [38] and 16 harmful prompts adapted3 from the red-teaming dataset of Ganguli et al. [23] to ensure coverage of each of their 17 harmful prompt tags.
Dataset Splits	Yes	To assess the vulnerability of safety-trained LLMs to jailbreaks, we test state-of-the-art models from Open AI and Anthropic on two datasets of harmful requests: a curated set of 32 prompts adapted from red-teaming efforts of Open AI and Anthropic [38, 7] and a held-out set of 317 prompts (that was not seen by the authors until after data collection).
Hardware Specification	No	The paper states that interactions with the models were done through APIs, indicating that the authors did not run the models on their own specific hardware, and thus no hardware specifications are provided for the experiments.
Software Dependencies	No	The paper mentions interacting with models through APIs but does not specify any software dependencies with version numbers (e.g., programming languages, libraries, or frameworks) used for running the experiments.
Experiment Setup	Yes	To minimize the impact of noise from decoding, we sample with temperature 0. ... For the GPT models, which have a system prompt, we use the suggested system prompt You are a helpful assistant. [39].