Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

$f$-Divergence Policy Optimization in Fully Decentralized Cooperative MARL

Authors: Kefan Su, Zongqing Lu

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, we demonstrate that TVPO outperforms state-of-the-art fully decentralized learning methods on three popular cooperative MARL benchmarks, thereby verifying the efficacy of TVPO. The experiments contain four main parts.
Researcher Affiliation Academia Kefan Su EMAIL School of Computer Science Peking University Zongqing Lu EMAIL School of Computer Science Peking University
Pseudocode Yes The practical algorithm of TVPO is summarized in Algorithm 1 in Appendix C.
Open Source Code Yes To ensure reproducibility, our codes are included in the supplementary material and will be open source upon acceptance.
Open Datasets Yes SMAC (Samvelyan et al., 2019), multi-agent Mu Jo Co (Peng et al., 2021) and SMACv2 (Ellis et al., 2023)
Dataset Splits No No explicit dataset splits are mentioned in the paper. The paper focuses on evaluating performance on various tasks within different environments (SMAC, Multi-Agent Mu Jo Co, SMACv2, MPE), which inherently define the test scenarios, but do not involve traditional training/test/validation splits of a pre-collected static dataset.
Hardware Specification Yes We perform the whole experiment with a total of four NVIDIA A100 GPUs.
Software Dependencies Yes The version of the game Star Craft2 in SMAC is 4.10 for our experiments in all the SMAC tasks.
Experiment Setup Yes We use 3-layer MLPs for the actor and the critic and use Re LU as non-linearities. The number of the hidden units of the MLP is 128. We train all the networks with an Adam optimizer. The learning rates of the actor and critic are both 5e-4. The number of epochs for every batch of samples is 15 which is the recommended value in Yu et al. (2021). For IPPO, the clip parameter is 0.2 which is the same as Schulman et al. (2017). For DPO, the hyperparameter is set as the original paper (Su & Lu, 2022b) recommends. Our code of IQL is based on the open-source code2 Py MARL (Apache-2.0 license) and we modify the code for individual parameters. The default architecture in Py MARL is RNN so we just follow it and the number of the hidden units is 128. The learning rate of IQL is also 5e-4. The architectures of the actor and critic of IDDPG are 3-layer MLPs. The learning rates of the actor and critic are both 5e-4. Our code of I2Q is from the open source code3 of the original paper (Jiang & Lu, 2022). We keep the hyperparameter of I2Q the same as the default value of the open-source code in our experiments. The version of the game Star Craft2 in SMAC is 4.10 for our experiments in all the SMAC tasks. We set the episode length of all the multi-agent Mu Jo Co tasks as 1000 in all of our multi-agent Mu Jo Co experiments. We perform the whole experiment with a total of four NVIDIA A100 GPUs. We have summarized the hyperparameters in Table 4.