Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Automated Red Teaming with GOAT: the Generative Offensive Agent Tester

Authors: Maya Pavlova, Erik Brinkman, Krithika Iyer, Vı́tor Albiero, Joanna Bitton, Hailey Nguyen, Cristian Canton Ferrer, Ivan Evtimov, Aaron Grattafiori

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We present the design and evaluation of GOAT, demonstrating its effectiveness in identifying vulnerabilities in state-of-the-art LLMs, with an ASR@10 of 96% against smaller models such as Llama 3.1 8B, and 91% against Llama 3.1 70B and 94% for GPT4o when evaluated against larger models on the Jailbreak Bench dataset.
Researcher Affiliation	Industry	1Meta 2Work done while at Meta. Correspondence to: Maya Pavlova <EMAIL>, Ivan Evtimov <EMAIL>.
Pseudocode	Yes	To generate natural language, multi-turn conversations, the attacker and target LLM are effectively paired together to converse as outlined in Alg. 1. The attacker LLM is instantiated with a system prompt containing a repertoire of available attack definitions, and then prompted to formulate an initial conversation prompt given an objective (Fig. A.2).
Open Source Code	No	The paper does not explicitly state that open-source code for the methodology (GOAT) is provided, nor does it provide a direct link to a code repository.
Open Datasets	Yes	As the latest and most recently updated work, we choose the set of violating behaviors from Jailbreak Bench (Chao et al., 2024). ... We additionally evaluate on AILluminate (Vidgen et al., 2024; Ghosh et al., 2025) a safety benchmark released by the MLCommons alliance and endorsed by 72 universities and research labs... We used the sample publicly available at https://github.com/mlcommons/ailuminate
Dataset Splits	No	The paper uses existing benchmarks (Jailbreak Bench, AILluminate) which are essentially test sets of prompts for evaluating target models. It mentions filtering some prompts from these datasets but does not describe training, validation, or test splits for its own experimental setup or for the models under test.
Hardware Specification	No	The paper discusses various LLMs (Llama, GPT, Claude) used as attacker or target models, but does not specify the underlying hardware (e.g., GPU models, CPU types, memory) used to conduct the experiments.
Software Dependencies	No	The paper mentions several LLMs by name (Llama, GPT, Claude) which are core components of the system, but does not provide specific version numbers for any software dependencies (e.g., programming languages, libraries, frameworks) used in the implementation of GOAT or its experimental setup.
Experiment Setup	Yes	All attacks used the recommended settings and default system prompts for the target LLMs. Additionally, for all attacks reported here, we cap the maximum number of conversation turns at 5. If the target LLM runs out of context before that turn cap is reached, we only consider the attack a success if an earlier conversation response produced a violating response by the target LLM. Due to non-determinism in language model decoding, we initiate a conversation between our adversarial agent and each target model k times. Then, we report ASR@k, measuring whether at least one of these k conversations produced at least one unsafe model message.