Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Automated Red Teaming with GOAT: the Generative Offensive Agent Tester

Authors: Maya Pavlova, Erik Brinkman, Krithika Iyer, Vı́tor Albiero, Joanna Bitton, Hailey Nguyen, Cristian Canton Ferrer, Ivan Evtimov, Aaron Grattafiori

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present the design and evaluation of GOAT, demonstrating its effectiveness in identifying vulnerabilities in state-of-the-art LLMs, with an ASR@10 of 96% against smaller models such as Llama 3.1 8B, and 91% against Llama 3.1 70B and 94% for GPT4o when evaluated against larger models on the Jailbreak Bench dataset.
Researcher Affiliation Industry 1Meta 2Work done while at Meta. Correspondence to: Maya Pavlova <EMAIL>, Ivan Evtimov <EMAIL>.
Pseudocode Yes To generate natural language, multi-turn conversations, the attacker and target LLM are effectively paired together to converse as outlined in Alg. 1. The attacker LLM is instantiated with a system prompt containing a repertoire of available attack definitions, and then prompted to formulate an initial conversation prompt given an objective (Fig. A.2).
Open Source Code No The paper does not explicitly state that open-source code for the methodology (GOAT) is provided, nor does it provide a direct link to a code repository.
Open Datasets Yes As the latest and most recently updated work, we choose the set of violating behaviors from Jailbreak Bench (Chao et al., 2024). ... We additionally evaluate on AILluminate (Vidgen et al., 2024; Ghosh et al., 2025) a safety benchmark released by the MLCommons alliance and endorsed by 72 universities and research labs... We used the sample publicly available at https://github.com/mlcommons/ailuminate
Dataset Splits No The paper uses existing benchmarks (Jailbreak Bench, AILluminate) which are essentially test sets of prompts for evaluating target models. It mentions filtering some prompts from these datasets but does not describe training, validation, or test splits for its own experimental setup or for the models under test.
Hardware Specification No The paper discusses various LLMs (Llama, GPT, Claude) used as attacker or target models, but does not specify the underlying hardware (e.g., GPU models, CPU types, memory) used to conduct the experiments.
Software Dependencies No The paper mentions several LLMs by name (Llama, GPT, Claude) which are core components of the system, but does not provide specific version numbers for any software dependencies (e.g., programming languages, libraries, frameworks) used in the implementation of GOAT or its experimental setup.
Experiment Setup Yes All attacks used the recommended settings and default system prompts for the target LLMs. Additionally, for all attacks reported here, we cap the maximum number of conversation turns at 5. If the target LLM runs out of context before that turn cap is reached, we only consider the attack a success if an earlier conversation response produced a violating response by the target LLM. Due to non-determinism in language model decoding, we initiate a conversation between our adversarial agent and each target model k times. Then, we report ASR@k, measuring whether at least one of these k conversations produced at least one unsafe model message.