Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
MAS-GPT: Training LLMs to Build LLM-based Multi-Agent Systems
Authors: Rui Ye, Shuo Tang, Rui Ge, Yaxin Du, Zhenfei Yin, Siheng Chen, Jing Shao
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on 9 benchmarks and 5 LLMs show that the proposed MAS-GPT consistently outperforms 10+ baseline MAS methods on diverse settings, indicating MAS-GPT s high effectiveness, efficiency and strong generalization ability. |
| Researcher Affiliation | Academia | 1Shanghai Jiao Tong University 2Shanghai AI Laboratory 3University of Oxford 4The University of Sydney. Correspondence to: Siheng Chen <EMAIL>, Jing Shao <EMAIL>. |
| Pseudocode | Yes | Figure 2: Our unified code representation of an executable MAS (i.e., a forward function). Each color denotes an agent. Agents defined by variables, LLM calls denoted by function calls, and interactions represented by string concatenations. Listing 1: Case 1: Multi-agent system generated by MAS-GPT. MAS-GPT can generate query-specific MAS. MAS-GPT designs five independent responding agents, each responsible for different aspects of the task. |
| Open Source Code | Yes | The codes are released at https: //github.com/rui-ye/MAS-GPT. |
| Open Datasets | Yes | Our training queries are sampled from the training splits available in MATH (Hendrycks et al., 2021b), GSM8K (Cobbe et al., 2021), MBPP (Austin et al., 2021), MMLU (Hendrycks et al., 2021a), and Sci Q (Welbl et al., 2017), covering domains of math, coding, and general QA. Testing. To verify that our MAS-GPT can handle diverse queries in practice, we consider multiple benchmarks from diverse domains. These include MATH (Hendrycks et al., 2021b), GSM8K (Cobbe et al., 2021), GSMHard (Gao et al., 2023), AIME-2024 for math domains; Human Eval (Chen et al., 2021) and Human Eval+ (Liu et al., 2023) for coding tasks; MMLU (Hendrycks et al., 2021a) for general QA tasks; GPQA (Rein et al., 2023) and Sci Bench (Wang et al., 2024a) for science topics. |
| Dataset Splits | Yes | Our training queries are sampled from the training splits available in MATH (Hendrycks et al., 2021b), GSM8K (Cobbe et al., 2021), MBPP (Austin et al., 2021), MMLU (Hendrycks et al., 2021a), and Sci Q (Welbl et al., 2017), covering domains of math, coding, and general QA. |
| Hardware Specification | Yes | We train the LLM using 16 A100s with an effective batch size of 32 for 3 epochs at a learning rate of 1e-5 (Zheng et al., 2024). |
| Software Dependencies | No | The paper mentions LLMs like "Qwen2.5-Coder-32B-Instruct (Yang et al., 2024)" and "Llama-3-70B-Instruct (Dubey et al., 2024)" but does not provide specific version numbers for general ancillary software components like programming languages (e.g., Python), libraries (e.g., PyTorch, TensorFlow), or other solvers. |
| Experiment Setup | Yes | We train the LLM using 16 A100s with an effective batch size of 32 for 3 epochs at a learning rate of 1e-5 (Zheng et al., 2024). |