Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Multi-Agent Imitation by Learning and Sampling from Factorized Soft Q-Function

Authors: Yi-Chen Li, Zhongxiang Ling, Tao Jiang, Fuxiang Zhang, Pengyuan Wang, Lei Yuan, Zongzhang Zhang, Yang Yu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on common benchmarks including the discrete control tasks Star Craft Multi-Agent Challenge v2 (SMACv2), Gold Miner, and Multi Particle Environments (MPE), as well as the continuous control task Multi-Agent Mu Jo Co (Ma Mu Jo Co), demonstrate that MAFIS achieves superior performance compared with baselines. Our code is available at https://github.com/LAMDA-RL/MAFIS.
Researcher Affiliation Collaboration 1 National Key Laboratory for Novel Software Technology, Nanjing University, China, 2 School of Artificial Intelligence, Nanjing University, Nanjing, China, 3 Nanyang Technological University, Singapore, 4 Polixir Technologies, Nanjing, China
Pseudocode Yes Algorithm 1 summarizes the pseudo code of MAFIS.
Open Source Code Yes Our code is available at https://github.com/LAMDA-RL/MAFIS.
Open Datasets Yes Experiments on common benchmarks including the discrete control tasks Star Craft Multi-Agent Challenge v2 (SMACv2) [13], Gold Miner [15], and Multi Particle Environments (MPE) [27], as well as the continuous control task Multi-Agent Mu Jo Co (Ma Mu Jo Co) [11], demonstrate that MAFIS achieves superior performance compared with baselines.
Dataset Splits No The paper mentions collecting expert trajectories (100 for discrete, 20 for continuous tasks) and rolling out 10 trajectories for evaluation. However, it does not specify explicit training/validation/test splits for these collected demonstrations.
Hardware Specification Yes We use the following hardware: NVIDIA RTX 4090 x 8 12th Gen Intel(R) Core(TM) i9-12900K
Software Dependencies Yes We use the following software versions: Python 3.7 Gym 0.21.0 [6] Mu Jo Co-py 2.1.2.14 Py Torch 1.12.1 [28]
Experiment Setup Yes The hyper-parameter settings used for benchmarks results are presented in Table 2 and Table 3. Table 2: Hyper-parameter settings for discrete control tasks. Hyper-parameter Value batch size 32 α 0.5 for zerg_{10_vs_10, 5_vs_5} and protoss_5_vs_5 0.2 for others (online) update frequency 5 for MPE and SMACv2 2 for Gold Miner Table 3: Hyper-parameter settings for continuous control tasks. Hyper-parameter Value batch size 1000 Langevin steps K 25 Langevin nose variance σ2 0.25 Sample number N 20 Entropy weight α 0.5 (online) update frequency 5 Additionally, for discrete control tasks, we introduce dropout with a rate of 0.5 in the mixing network to mitigate the risk of over-fitting. For continuous control tasks, we incorporate a target Q-network, which is updated using the Polyak average update mechanism [30] with an update ratio of 0.005. To ensure stable training, we use the target Q-network to sample actions to estimate Equation (8). Furthermore, we apply a gradient penalty to the Q-network with a coefficient of 0.25 and a gradient margin of 1. We also found that constraining the output of the Q-network can further improve performance. Therefore, we apply L2 regularization to its output with a coefficient of 0.01.