Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

MAS-GPT: Training LLMs to Build LLM-based Multi-Agent Systems

Authors: Rui Ye, Shuo Tang, Rui Ge, Yaxin Du, Zhenfei Yin, Siheng Chen, Jing Shao

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on 9 benchmarks and 5 LLMs show that the proposed MAS-GPT consistently outperforms 10+ baseline MAS methods on diverse settings, indicating MAS-GPT s high effectiveness, efficiency and strong generalization ability.
Researcher Affiliation	Academia	1Shanghai Jiao Tong University 2Shanghai AI Laboratory 3University of Oxford 4The University of Sydney. Correspondence to: Siheng Chen <EMAIL>, Jing Shao <EMAIL>.
Pseudocode	Yes	Figure 2: Our unified code representation of an executable MAS (i.e., a forward function). Each color denotes an agent. Agents defined by variables, LLM calls denoted by function calls, and interactions represented by string concatenations. Listing 1: Case 1: Multi-agent system generated by MAS-GPT. MAS-GPT can generate query-specific MAS. MAS-GPT designs five independent responding agents, each responsible for different aspects of the task.
Open Source Code	Yes	The codes are released at https: //github.com/rui-ye/MAS-GPT.
Open Datasets	Yes	Our training queries are sampled from the training splits available in MATH (Hendrycks et al., 2021b), GSM8K (Cobbe et al., 2021), MBPP (Austin et al., 2021), MMLU (Hendrycks et al., 2021a), and Sci Q (Welbl et al., 2017), covering domains of math, coding, and general QA. Testing. To verify that our MAS-GPT can handle diverse queries in practice, we consider multiple benchmarks from diverse domains. These include MATH (Hendrycks et al., 2021b), GSM8K (Cobbe et al., 2021), GSMHard (Gao et al., 2023), AIME-2024 for math domains; Human Eval (Chen et al., 2021) and Human Eval+ (Liu et al., 2023) for coding tasks; MMLU (Hendrycks et al., 2021a) for general QA tasks; GPQA (Rein et al., 2023) and Sci Bench (Wang et al., 2024a) for science topics.
Dataset Splits	Yes	Our training queries are sampled from the training splits available in MATH (Hendrycks et al., 2021b), GSM8K (Cobbe et al., 2021), MBPP (Austin et al., 2021), MMLU (Hendrycks et al., 2021a), and Sci Q (Welbl et al., 2017), covering domains of math, coding, and general QA.
Hardware Specification	Yes	We train the LLM using 16 A100s with an effective batch size of 32 for 3 epochs at a learning rate of 1e-5 (Zheng et al., 2024).
Software Dependencies	No	The paper mentions LLMs like "Qwen2.5-Coder-32B-Instruct (Yang et al., 2024)" and "Llama-3-70B-Instruct (Dubey et al., 2024)" but does not provide specific version numbers for general ancillary software components like programming languages (e.g., Python), libraries (e.g., PyTorch, TensorFlow), or other solvers.
Experiment Setup	Yes	We train the LLM using 16 A100s with an effective batch size of 32 for 3 epochs at a learning rate of 1e-5 (Zheng et al., 2024).