Multiagent Gumbel MuZero: Efficient Planning in Combinatorial Action Spaces

Authors: Xiaotian Hao, Jianye Hao, Chenjun Xiao, Kai Li, Dong Li, Yan Zheng

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments validate that our method reliably outperforms the prior algorithms especially when using much fewer simulation budgets. The code and appendix are available at https://github.com/tju Hao Xiaotian/MA-Mu Zero. Single State Cooperative Matrix Game First, we use two simple matrix games to verify the benefit of using sampling without replacement (Algo. 1) instead of sampling with replacement (Monte Carlo sampling) when doing policy evaluation and improvement. Setup. Table 2 shows the payoffs of a 2-player 3-action matrix game and the payoffs of a 3-player 2-action matrix game. Results. Fig. 2 shows the learning curves of the three methods. The left two figures show the average test rewards and the right two figures show the percentages of optimal actions. Multiagent Switch As shown in Fig 3, Switch4 in MA-Gym (Koul 2019) is a hard coordination task that 4 agents need to reach their corresponding home by passing through the one-agent wide narrow corridor. The learning curves under different simulation budgets are shown in Figure 4 and Figure 5. Results are averaged over 5 random seeds.
Researcher Affiliation Collaboration Xiaotian Hao1, Jianye Hao1, Chenjun Xiao2, Kai Li2, Dong Li2, Yan Zheng1 1College of Intelligence and Computing, Tianjin University 2Noah s Ark Lab, Huawei
Pseudocode Yes Algorithm 1: Stochastically sample top-k joint actions without replacement from the joint policy π.
Open Source Code Yes The code and appendix are available at https://github.com/tju Hao Xiaotian/MA-Mu Zero.
Open Datasets Yes Multiagent Switch As shown in Fig 3, Switch4 in MA-Gym (Koul 2019) is a hard coordination task that 4 agents need to reach their corresponding home by passing through the one-agent wide narrow corridor.
Dataset Splits No The paper mentions using a 'test set' for evaluation and 'learning curves', but it does not explicitly specify train/validation/test dataset splits, or how data was partitioned for validation purposes.
Hardware Specification No The paper does not provide any specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments.
Software Dependencies No The paper does not specify any software dependencies with version numbers (e.g., Python 3.x, PyTorch 1.x).
Experiment Setup Yes We set k = 4 and keep all parameters the same for these three methods. We set the sample number k = clamp(nsim/2, min=2, max=16). Results are averaged over 5 random seeds.