Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Learning and Planning Multi-Agent Tasks via an MoE-based World Model
Authors: Zijie Zhao, Zhongyue Zhao, Kaixuan Xu, Yuqian Fu, Jiajun Chai, Yuanheng Zhu, Dongbin Zhao
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | M3W demonstrates superior performance, sample efficiency, and multi-task adaptability, as validated on Bi-Dex Hands with 14 tasks and MA-Mujoco with 24 tasks. |
| Researcher Affiliation | Collaboration | 1School of Artificial Intelligence, University of Chinese Academy of Sciences 2Institute of Automation, Chinese Academy of Sciences, 3Meituan EMAIL EMAIL, EMAIL |
| Pseudocode | Yes | B Pseudocode We present the pseudocode for M3W training and planning, as shown in Algorithm 1 and Algorithm 2, respectively. |
| Open Source Code | Yes | The code are available at https://github. com/zhaozijie2022/m3w-marl. |
| Open Datasets | Yes | We evaluate M3W and the baselines on on two challenging benchmarks: Bimanual Dexterous Hands (Bi-Dex Hands)[4] with 14 tasks, and the multi-agent Mujoco (MA-Mujoco)[26] with 24 tasks. |
| Dataset Splits | No | We collected 25K steps of transitions for each task using random actions, resulting in a combined dataset of 150K transitions, which was used consistently across all models for training. |
| Hardware Specification | Yes | We report the per-step execution time on a single RTX A6000 GPU (Figure 15, left), where H denotes the rollout horizon and Kp is the number of planner iterations. |
| Software Dependencies | No | The paper does not explicitly state specific version numbers for software dependencies like Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | The hyperparameters for the model-based planner have been provided in Appendix A. Here, we present the hyperparameters regarding the the training and network structures in Table 3 and Table 4. Table 3: The training hyperparameters. Hyperparameters Value Hyperparameters Value Hyperparameters Value buffer size 1e6 batch size 256 train interval 1 step balance λ 0.5 lr 5e-4 encoder lr 1.5e-4 n-step return 10 gamma 0.99 Table 4: The network configurations. Hyperparameters Value Hyperparameters Value Hyperparameters Value task dim 96 latent dim 512 encoder size [256] Sim Norm dim 8 num experts 16 experts size [512, 512] predictor K 2 actor & critic size [512, 512] num bins 101 scale ρ 0.01 dynamics coef 20 reward coef 0.1 q coef 0.1 entropy coef 0.01 |