reproducibilityindex.ai

Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game

Authors: Sam Toyer, Olivia Watkins, Ethan Adrian Mendes, Justin Svegliato, Luke Bailey, Tiffany Wang, Isaac Ong, Karim Elmaaroufi, Pieter Abbeel, Trevor Darrell, Alan Ritter, Stuart Russell

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We ran a suite of eight baseline models against our benchmarks in order to measure how effective existing LLMs are at rebuffing attacks. Our benchmark results show that many models are vulnerable to the attack strategies in the Tensor Trust dataset.
Researcher Affiliation	Academia	1 UC Berkeley 2 Georgia Tech 3 Harvard University
Pseudocode	No	The paper describes methods and uses regular expressions, but it does not contain structured pseudocode or algorithm blocks.
Open Source Code	Yes	We release data and code at tensortrust.ai/paper
Open Datasets	Yes	We release a full dump of attacks and defenses provided by Tensor Trust players (minus a small number that violated our To S). The structure of this dataset is illustrated in Fig. 3. We release data and code at tensortrust.ai/paper
Dataset Splits	Yes	The benchmarks and all analysis in this paper are derived from only the first 127,000 attacks and 46,000 defenses, which were all evaluated against GPT 3.5 Turbo... After randomly pairing attacks with good defenses in order to build an evaluation dataset, we adversarially filter to include only those attack/defense combinations which succeeded in extracting the defense’s access code from at least two of the three reference LLMs.
Hardware Specification	No	The paper lists the LLMs used (e.g., GPT 3.5 Turbo, Claude Instant 1.2, PaLM Chat Bison 001) and how some were accessed (e.g., PaLM 2 via the Vertex AI SDK for Python), but it does not specify the hardware specifications (e.g., GPU models, CPU models, memory) of the machines used to conduct the experiments and evaluations.
Software Dependencies	Yes	Our game uses Open AI’s GPT 3.5 Turbo (06/13 version), Anthropic’s Claude Instant 1.2, and Google’s Pa LM Chat Bison 001. The models are GPT 3.5 Turbo (Brown et al., 2020); GPT4 (Open AI, 2023); Claude-instant-v1.2 (Anthropic, 2023a; Bai et al., 2022); Claude-2.0 (Anthropic, 2023c;b); Pa LM 2 (Anil et al., 2023); LLa MA 2 Chat in 7B, 13B and 70B variants (Touvron et al., 2023); and Code LLa MA-34B-instruct (Rozi ere et al., 2023).
Experiment Setup	Yes	During sampling, we set temperature=0 to reduce randomness and limited the length of opening defenses (300 tokens), access codes (150 tokens), closing defenses (200 tokens), attacks (500 tokens), and LLM responses (500 tokens). In GPT 3.5 Turbo, each message must be assigned a role of either system or user. In Tensor Trust, we marked the opening defense as a system message, the attack as a user message, and the closing defense as a user message.