Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

MASTER: Enhancing Large Language Model via Multi-Agent Simulated Teaching

Authors: Liang Yue, Yihong Tang, Kehai Chen, Jie Liu, Min zhang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To rigorously assess the effectiveness of the MASTER method, we conducted comprehensive experiments comparing the performance of base models fine-tuned on the original datasets, datasets augmented by other methods, and those created through MASTER. The results show that BOOST-QA significantly enhances the diverse capabilities of large language models (LLMs), outperforming several existing approaches focused on data augmentation and selection.
Researcher Affiliation	Academia	1Harbin Institute of Technology, Shenzhen, China 2Shenzhen Loop Area Institute (SLAI), Shenzhen, China 3Harbin Institute of Technology, China
Pseudocode	No	The paper includes mathematical equations (Equation 1, 2, 3, 4) and structured examples of prompts but no explicitly labeled pseudocode or algorithm blocks describing a method or procedure.
Open Source Code	Yes	Our code is publicly available at https://github.com/Toyhom/MASTER.
Open Datasets	Yes	Training datasets. We used three instruction-tuning datasets: (1) Orca-Math-Word-200K, a highquality set of elementary math QA pairs generated via multi-agent collaboration [39]; (2) Proc QA, mixed-modality programming QA pairs from Stack Overflow [40]; and (3) Open Hermes 2.5, a general-purpose dataset covering commonsense QA and reasoning.
Dataset Splits	Yes	Training datasets. We used three instruction-tuning datasets: (1) Orca-Math-Word-200K, a highquality set of elementary math QA pairs generated via multi-agent collaboration [39]; (2) Proc QA, mixed-modality programming QA pairs from Stack Overflow [40]; and (3) Open Hermes 2.5, a general-purpose dataset covering commonsense QA and reasoning. We sampled 10,000 instances each from Orca-Math-Word-200K and Proc QA, and 9,000 from Open Hermes 2.5, forming the original dataset (ori-data). Applying the MASTER augmentation method to ori-data produced an equal-sized enhanced dataset (19,000 samples), termed BOOST-QA. Correctness verification with a locally deployed Qwen2.5-32B-Instruct model showed only 4.1% of augmented samples contained procedural reasoning errors. Evaluation datasets. We evaluated our method on Human Eval [41], MBPP [42], MATH [43], MMLU-PRO-MATH [44], MMLU [45], ARC [46] and SCI-Q[47]. These datasets encompass various domains and task types, including human-written coding challenges, mathematical problem-solving, multi-choice questions, and scientific reasoning, thereby providing a comprehensive assessment of our method s capabilities. During evaluation, we assessed the zero-shot capabilities of the MASTERmodel series across these datasets.
Hardware Specification	Yes	We conducted our experiments on a local Slurm-based computing cluster, utilizing nodes equipped with 48-core CPUs, eight NVIDIA L20 GPUs each with 48 GB of memory, and 925,600 MB of system RAM.
Software Dependencies	No	For model fine-tuning, we employed the LLa MA-Factory framework, applying the Low-Rank Adaptation (Lo RA) technique to efficiently fine-tune the LLa MA3-8B-base, Mistral-7Bbase, and Qwen2.5-7B-base models.
Experiment Setup	Yes	Each model was fine-tuned for two epochs with a learning rate of 1e-4, requiring approximately 12 hours of training on two L20 GPUs. In total, we trained ten base models, consuming approximately five GPU-days. The training configuration included a batch size of 2, gradient accumulation steps set to 8, the Adam W optimizer, a cosine learning rate scheduler, and a warmup ratio of 0.1.