Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

AutoRedTeamer: Autonomous Red Teaming with Lifelong Attack Integration

Authors: Andy Zhou, Kevin Wu, Francesco Pinto, Zhaorun Chen, Yi Zeng, Yu Yang, Shuang Yang, Sanmi Koyejo, James Y Zou, Bo Li

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate Auto Red Teamer s effectiveness across diverse evaluation settings, achieving 20% higher attack success rates on Harm Bench against Llama-3.1-70B while reducing computational costs by 46% compared to existing approaches. Auto Red Teamer also matches the diversity of human-curated benchmarks in generating test cases, providing a comprehensive, scalable, and continuously evolving framework for evaluating the security of AI systems.
Researcher Affiliation Collaboration Andy Zhou University of Illinois Urbana-Champaign Kevin Wu Stanford University Francesco Pinto University of Chicago Zhaorun Chen University of Chicago Yi Zeng Virtue AI Yu Yang Virtue AI Shuang Yang Meta AI Sanmi Koyejo Virtue AI James Zou Stanford University Bo Li Virtue AI
Pseudocode Yes Complete technical details, pseudocode, attack implementations and prompts are in Sections C, H, E, and G of the Appendix.
Open Source Code No Answer: [No] Justification: Code currently withheld due to security reasons
Open Datasets Yes We evaluate on 240 seed prompts from Harm Bench (Mazeika et al., 2024) focusing on standard and contextual behaviors, following prior work (Zou et al., 2024).
Dataset Splits No The paper mentions evaluating on "240 seed prompts from Harm Bench" and performing "Initial validation occurs through VALIDATEATTACK on a subset of Harm Bench." However, it does not provide specific details on how these subsets or splits were created, their sizes, or explicit training/validation/test partitions for the experimental evaluation of their system.
Hardware Specification No The paper does not explicitly state the specific hardware used for running its experiments (e.g., GPU models, CPU types, memory details). While it mentions using specific LLMs like Mixtral-8x22B-Instruct-v0.1 and Claude-3.5-Sonnet, these are models, not the underlying hardware on which they were executed.
Software Dependencies No The paper mentions implementing attacks as "Python class[es]" and using specific LLM models (Mixtral-8x22B-Instruct-v0.1, Claude-3.5-Sonnet) but does not provide specific version numbers for Python, any libraries, or other ancillary software components required to replicate the experiments.
Experiment Setup Yes We evaluate Auto Red Teamer in two complementary settings that demonstrate distinct advantages: (1) enhancing jailbreaking effectiveness for specific test prompts, and (2) automating comprehensive risk assessment from high-level categories. We use Mixtral-8x22B-Instruct-v0.1 (Team, 2024) for each module, except for attack implementation where we use Claude-3.5-Sonnet (Anthropic, 2024). In the first setting, we evaluate on 240 seed prompts from Harm Bench (Mazeika et al., 2024) focusing on standard and contextual behaviors, following prior work (Zou et al., 2024). Here, the primary goal is maximizing attack success rate through effective attack combinations. We evaluate Auto Red Teamer on four target models: GPT-4o (Open AI, 2024), Llama-3.1-70b (Dubey et al., 2024), Mixtral-8x7b (Team, 2024), and Claude-3.5-Sonnet (Anthropic, 2024). For standardized comparison to baselines, we omit the Seed Prompt Generator and directly refine Harm Bench prompts, using GPT-4o with the Harm Bench evaluation prompt (Li et al., 2024b). We initialize the attack library with four human-based attacks as a starting point to ensure diversity: (1) PAIR (Chao et al., 2023) which uses an LLM to refine the prompt, (2) Art Prompt (Jiang et al., 2024a) which adds an ASCII-based encoding, (3) Human Jailbreaks (Wei et al., 2023a), various human-written jailbreaks, and (4) the Universal Pliny Prompt (the Prompter, 2024), a more effective jailbreak written by an expert.