Jailbroken: How Does LLM Safety Training Fail?
Authors: Alexander Wei, Nika Haghtalab, Jacob Steinhardt
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We then conduct an empirical evaluation of state-of-the-art safety-trained models, including Open AI s GPT-4 and Anthropic s Claude v1.3, against both existing and newly constructed jailbreak attacks. We evaluate on both a curated dataset of harmful prompts from these models red-teaming evaluation sets and a larger synthetic dataset of harmful prompts for broader coverage. |
| Researcher Affiliation | Academia | Alexander Wei UC Berkeley awei@berkeley.edu Nika Haghtalab UC Berkeley nika@berkeley.edu Jacob Steinhardt UC Berkeley jsteinhardt@berkeley.edu |
| Pseudocode | No | The paper describes attack designs and evaluation procedures in natural language and figures, but it does not include any formal pseudocode or algorithm blocks. |
| Open Source Code | No | Our code and data are available to researchers upon request. |
| Open Datasets | Yes | The curated dataset consists of all 16 examples of harmful prompts used to evaluate GPT-4 from its report [38] and 16 harmful prompts adapted3 from the red-teaming dataset of Ganguli et al. [23] to ensure coverage of each of their 17 harmful prompt tags. |
| Dataset Splits | Yes | To assess the vulnerability of safety-trained LLMs to jailbreaks, we test state-of-the-art models from Open AI and Anthropic on two datasets of harmful requests: a curated set of 32 prompts adapted from red-teaming efforts of Open AI and Anthropic [38, 7] and a held-out set of 317 prompts (that was not seen by the authors until after data collection). |
| Hardware Specification | No | The paper states that interactions with the models were done through APIs, indicating that the authors did not run the models on their own specific hardware, and thus no hardware specifications are provided for the experiments. |
| Software Dependencies | No | The paper mentions interacting with models through APIs but does not specify any software dependencies with version numbers (e.g., programming languages, libraries, or frameworks) used for running the experiments. |
| Experiment Setup | Yes | To minimize the impact of noise from decoding, we sample with temperature 0. ... For the GPT models, which have a system prompt, we use the suggested system prompt You are a helpful assistant. [39]. |