Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

SYMPHONY: Synergistic Multi-agent Planning with Heterogeneous Language Model Assembly

Authors: Wei Zhu, Zhiwen Tang, Kun Yue

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical results across multiple benchmark tasks show that SYMPHONY achieves strong performance even when instantiated with open-source LLMs deployable on consumer-grade hardware. When enhanced with cloud-based LLMs accessible via API, SYMPHONY demonstrates further improvements, outperforming existing state-of-the-art baselines and underscoring the effectiveness of heterogeneous multi-agent coordination in planning tasks.
Researcher Affiliation	Academia	Wei Zhu Zhiwen Tang Kun Yue School of Information Science and Engineering, Yunnan University, Kunming, China Yunnan Key Laboratory of Intelligent Systems and Computing, Kunming, China EMAIL, EMAIL
Pseudocode	Yes	The theoretical analysis and complete pseudocode of SYMPHONY can be found in Appendix A.The Pseudocode for SYMPHONY can be found at Algorithm 1.
Open Source Code	Yes	Corresponding author 2Code is available at https://github.com/ZHUWEI-hub/SYMPHONY
Open Datasets	Yes	We evaluate our approach across three representative tasks spanning reasoning, decision-making, and code generation. Specifically, we conduct experiments on: (1) multi-hop question answering using Hotpot QA [40] to assess reasoning capabilities; (2) goal-directed interaction on Web Shop [41] to evaluate decision-making and planning; and (3) code generation on MBPP [3] to test the model s ability to reason and produce executable solutions.
Dataset Splits	Yes	All experiments are carried out under a unified protocol aligned with previous work [29, 46, 12]. To ensure comparability, we apply consistent prompt formats and fixed hyperparameter settings across both configurations, including decoding temperature, planning depth, rollout budget, and number of demonstrations. Hotpot QA: We use K = 10 candidate actions per step and adopt 3 few-shot examples. Web Shop: We also set K = 10, but use a single few-shot example tailored to the task format. MBPP: We follow the setup in LATS and employ K = 8 with a zero-shot prompting strategy.
Hardware Specification	Yes	Under the SYMPHONY-S setting, since it involves three models, the system can be comfortably run on three 24GB RTX 4090 GPUs, with sufficient memory headroom.
Software Dependencies	No	The paper lists specific LLM models used (e.g., Qwen2.5-7B-Instruct-1M, Mistral-7B-Instruct-v0.3, Llama-3.1-8B-Instruct, GPT-4), but does not provide version numbers for ancillary software components like programming languages (e.g., Python), libraries (e.g., PyTorch, TensorFlow), or operating systems.
Experiment Setup	Yes	Unless otherwise specified, the following parameters are shared across all experiments: the number of rollouts per node is set to n = 4; the exploration constant in UCT is set to c = 2, following the configuration in LATS [46]; the UCB scheduling parameter is α = 20; the temperature for action-sampling agents is set to 0.2 to better follow the input instructions, while the evaluation agents use a temperature of 0 to ensure deterministic value estimation. Hotpot QA: We use K = 10 candidate actions per step and adopt 3 few-shot examples. Web Shop: We also set K = 10, but use a single few-shot example tailored to the task format. MBPP: We follow the setup in LATS and employ K = 8 with a zero-shot prompting strategy.