Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

From Debate to Equilibrium: Belief-Driven Multi-Agent LLM Reasoning via Bayesian Nash Equilibrium

Authors: Xie Yi, Zhanke Zhou, Chentao Cao, Qiyu Niu, Tongliang Liu, Bo Han

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, ECON outperforms existing multi-LLM approaches by 11.2% on average across six benchmarks spanning complex reasoning and planning tasks. Further experiments demonstrate ECON s ability to flexibly incorporate additional models, confirming its scalability and paving the way toward larger, more powerful multi-LLM ensembles. The code is publicly available at: https: //github.com/tmlr-group/ECON. [...] In this section, we present the experiment setup in Sec. 4.1, demonstrate the method effectiveness in Sec. 4.2, validate the heterogeneous results in Sec. 4.3, test scale-up capability in Sec. 4.4, and conduct ablation studies in Sec. 4.5.
Researcher Affiliation	Academia	1Academy for Engineering and Technology, Fudan University 2TMLR Group, Department of Computer Science, Hong Kong Baptist University 3Sydney AI Center, The University of Sydney. Correspondence to: Bo Han <EMAIL>.
Pseudocode	Yes	Algorithm 1 Belief Network Training Algorithm [...] Algorithm 2 Scaling-Up Framework for ECON
Open Source Code	Yes	The code is publicly available at: https: //github.com/tmlr-group/ECON.
Open Datasets	Yes	We evaluate 6 released opensourced LLMs: LLa MA3.1 8B (Dubey et al., 2024), LLa MA3.1 70B, Mistral-7B (Jiang et al., 2023), LLa MA3.1 405B, Mixtral-8x22B (Jiang et al., 2024) and Qwen1.5 110B (Yang et al., 2024) across 5 reasoning tasks, including 4 mathematical datasets (GSM8K (Cobbe et al., 2021), GSM-Hard (Gao et al., 2023), MATH (Hendrycks et al., 2021), SVAMP (Patel et al., 2021)) and one commonsense reasoning dataset (Strategy QA (Geva et al., 2021)). Then, we evaluate GPT4 turbo) (Achiam et al., 2023) in a very challenging planning task (Travelplanner (Xie et al., 2024a)) to further validate the performance. The details of benchmarks can be found in Appendix C.5.
Dataset Splits	Yes	We evaluate 6 released opensourced LLMs: LLa MA3.1 8B (Dubey et al., 2024), LLa MA3.1 70B, Mistral-7B (Jiang et al., 2023), LLa MA3.1 405B, Mixtral-8x22B (Jiang et al., 2024) and Qwen1.5 110B (Yang et al., 2024) across 5 reasoning tasks, including 4 mathematical datasets (GSM8K (Cobbe et al., 2021), GSM-Hard (Gao et al., 2023), MATH (Hendrycks et al., 2021), SVAMP (Patel et al., 2021)) and one commonsense reasoning dataset (Strategy QA (Geva et al., 2021)). Then, we evaluate GPT4 turbo) (Achiam et al., 2023) in a very challenging planning task (Travelplanner (Xie et al., 2024a)) to further validate the performance. The details of benchmarks can be found in Appendix C.5. [...] C.5. Task Setups [...] GSM8K is a benchmark for mathematical reasoning that requires multi-step problem solving. [...] The dataset contains approximately 7.5K problems in the training set and 1.3K problems in the test set. [...] Travel Planner is a benchmark crafted for evaluating language agents in tool-use and complex planning within multiple constraints. The dataset comprises 1,225 queries in total, divided into training (45 queries), validation (180 queries), and test (1,000 queries) sets.
Hardware Specification	No	The paper does not explicitly describe the specific hardware (e.g., GPU/CPU models, memory) used to run the experiments or train the ECON framework components. It mentions using the Together API to query different LLMs (LLa MA3.1 8B/70B/405B, Mistral-7B, GPT4 turbo), which implies using external API services rather than local, specified hardware.
Software Dependencies	No	The paper mentions using 'Adam' as an optimizer in the hyperparameters table (C.6.1-C.6.6) and 'Together API' for LLM inference (D. Together API Integration for ECON), but does not specify version numbers for these or any other software components (e.g., Python version, specific deep learning frameworks like PyTorch or TensorFlow versions, or CUDA versions).
Experiment Setup	Yes	The hyperparameters for training can be found in Appendix C.6. [...] C.6. Hyperparameter [...] Table 8: Hyperparameters (8B, MATH) [lists Training Configuration, Network Architecture, Temperature & Sampling, Reward Configuration, Loss Weights, Early Stopping parameters with specific values] [...] Table 9: Hyperparameters (8B, GSM8K) [lists similar detailed parameters] [...] Table 10: Hyperparameters (70B, MATH) [lists similar detailed parameters] [...] Table 11: Hyperparameters (70B, GSM8K) [lists similar detailed parameters] [...] Table 12: Hyperparameters (405B, MATH) [lists similar detailed parameters] [...] Table 13: Hyperparameters (405B, GSM8K) [lists similar detailed parameters]