Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game
Authors: Sam Toyer, Olivia Watkins, Ethan Adrian Mendes, Justin Svegliato, Luke Bailey, Tiffany Wang, Isaac Ong, Karim Elmaaroufi, Pieter Abbeel, Trevor Darrell, Alan Ritter, Stuart Russell
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We ran a suite of eight baseline models against our benchmarks in order to measure how effective existing LLMs are at rebuffing attacks. Our benchmark results show that many models are vulnerable to the attack strategies in the Tensor Trust dataset. |
| Researcher Affiliation | Academia | 1 UC Berkeley 2 Georgia Tech 3 Harvard University |
| Pseudocode | No | The paper describes methods and uses regular expressions, but it does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | We release data and code at tensortrust.ai/paper |
| Open Datasets | Yes | We release a full dump of attacks and defenses provided by Tensor Trust players (minus a small number that violated our To S). The structure of this dataset is illustrated in Fig. 3. We release data and code at tensortrust.ai/paper |
| Dataset Splits | Yes | The benchmarks and all analysis in this paper are derived from only the first 127,000 attacks and 46,000 defenses, which were all evaluated against GPT 3.5 Turbo... After randomly pairing attacks with good defenses in order to build an evaluation dataset, we adversarially filter to include only those attack/defense combinations which succeeded in extracting the defense’s access code from at least two of the three reference LLMs. |
| Hardware Specification | No | The paper lists the LLMs used (e.g., GPT 3.5 Turbo, Claude Instant 1.2, PaLM Chat Bison 001) and how some were accessed (e.g., PaLM 2 via the Vertex AI SDK for Python), but it does not specify the hardware specifications (e.g., GPU models, CPU models, memory) of the machines used to conduct the experiments and evaluations. |
| Software Dependencies | Yes | Our game uses Open AI’s GPT 3.5 Turbo (06/13 version), Anthropic’s Claude Instant 1.2, and Google’s Pa LM Chat Bison 001. The models are GPT 3.5 Turbo (Brown et al., 2020); GPT4 (Open AI, 2023); Claude-instant-v1.2 (Anthropic, 2023a; Bai et al., 2022); Claude-2.0 (Anthropic, 2023c;b); Pa LM 2 (Anil et al., 2023); LLa MA 2 Chat in 7B, 13B and 70B variants (Touvron et al., 2023); and Code LLa MA-34B-instruct (Rozi ere et al., 2023). |
| Experiment Setup | Yes | During sampling, we set temperature=0 to reduce randomness and limited the length of opening defenses (300 tokens), access codes (150 tokens), closing defenses (200 tokens), attacks (500 tokens), and LLM responses (500 tokens). In GPT 3.5 Turbo, each message must be assigned a role of either system or user. In Tensor Trust, we marked the opening defense as a system message, the attack as a user message, and the closing defense as a user message. |