Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

AgentBreeder: Mitigating the AI Safety Risks of Multi-Agent Scaffolds via Self-Improvement

Authors: J Rosser, Jakob Foerster

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate discovered scaffolds on widely recognized reasoning, mathematics, and safety benchmarks and compare them with popular baselines. In this paper, we introduce AGENTBREEDER, an evolutionary open-ended framework capable of generating large populations of diverse multi-agent scaffolds. By equipping this framework with multi-objective optimization, we explore the generation of multi-agent scaffolds along complementary objectives of capability and safety. AGENTBREEDER can be used to blue team a set of scaffolds to generate offspring that exhibit greater adversarial robustness and performance on capability benchmarks.
Researcher Affiliation	Academia	J Rosser University of Oxford EMAIL Jakob Foerster FLAIR University of Oxford
Pseudocode	Yes	Algorithm 1: Agent Breeder Input: Number of generations G; Number of clusters K; Number of evolutions M; Capability benchmark f C(s); Safety benchmark f S(s); Embedding function f D( ); Seed scaffolds Q0; Clustering function A( ). Initialize seed population P0 = Q0 of size N0. for generation g = 1 to G do for scaffold s Qg 1 do 1. Compute capability f C(s) and safety f S(s). 2. Compute embedding es f D(s). Cluster population into K clusters: C1, C2, . . . , CK A(e1, e2, ..., e Ng). Identify Pareto Elites Eg: 1. Set Eg . 2. for cluster k = 1 to K do (a) Find its Pareto front Fk using f C and f S. (b) Update elite cohort Eg Eg Fk. Generate offspring Qg: 1. Set Qg . 2. for evolution m = 1 to M do (a) Weighted sampling 1 or 2 elites from Eg. (b) If 2 elites, Meta Agent performs Crossover; otherwise Mutation. (c) Add the offspring to Qg. Update population: Pg Pg 1 Qg. Update population size: Ng Ng 1 + M. Output: Final population PG.
Open Source Code	Yes	Code is available at https://github.com/jrosseruk/AgentBreeder.
Open Datasets	Yes	We evaluate discovered scaffolds on widely recognized reasoning, mathematics, and safety benchmarks and compare them with popular baselines. MMLU [19] is a multi-task benchmark comprising multiple choice questions on 57 subjects across STEM, the humanities, the social sciences, and more. DROP [14] is an English reading comprehension benchmark which requires the model to reason over and answer questions on given paragraphs. GPQA [43] is a benchmark comprising graduate-level multiple-choice questions in the field of biology, physics, and chemistry. Salad Data [28] is a hierarchical and comprehensive safety benchmark spanning 3 levels. Truthful QA [29] is a question-answering benchmark comprising questions that some humans may answer incorrectly.
Dataset Splits	Yes	Before running our evolution on our chosen benchmark, we evaluate a single Co T agent on 1,000 samples from the validation set of the benchmark, oversampling and resampling where necessary. For each generation, we validate the newly discovered scaffolds using a balanced sampling strategy, selecting 50% positive and 50% negative samples.
Hardware Specification	No	The BLUEAGENTBREEDER experiment, comprising one 20-generation run on each of our 3 benchmarks as well as evaluations costs approximately $600, with the $500 from gpt-4o-mini-2024-07-18 and $100 from claude-3-5-sonnet-20241022-v2:0. The REDAGENTBREEDER experiment, comprising one 10-generation run on DROP cost $115 as expected. The CAPABLEAGENTBREEDER experiment, comprising one 20-generation run on each of our 3 benchmarks as well as evaluations costs approximately $400.
Software Dependencies	Yes	Claude 3.5 Sonnet [3] (claude-3-5-sonnet-20241022-v2:0) is used as the core model of the Meta Agent due to its state-of-the-art performance on code generation tasks [49]. In AGENTBREEDER, we use Open AI s text-embedding-3-small [39] model returning a 12-dimensional text embedding of the system name and code as our descriptor to encode semantic information about the name, structure, and potential logic embedded in the scaffold. We take the highest-performing scaffolds from ADAS [20] and evaluate them with GPT-4o mini [37] as their core model.
Experiment Setup	Yes	To achieve a balanced trade-off between system performance and system diversity, a distance threshold of 0.7 was selected. Weighting the mutation operator twice as highly as crossover was found empirically to lead to faster convergence. We ran BLUEAGENTBREEDER for 20 generations, on each of our three chosen capability benchmarks...REDAGENTBREEDER seeks to discover Red Team scaffolds... with only half the generation budget of BLUEAGENTBREEDER. As an ablation for our multi-objective criteria and to compare AGENTBREEDER against the seminal work, we run CAPABLEAGENTBREEDER a single-objective-variant of our framework for 20 generations, evolving 10 mutants each generation.