Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

$f$-Divergence Policy Optimization in Fully Decentralized Cooperative MARL

Authors: Kefan Su, Zongqing Lu

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, we demonstrate that TVPO outperforms state-of-the-art fully decentralized learning methods on three popular cooperative MARL benchmarks, thereby verifying the efficacy of TVPO. The experiments contain four main parts.
Researcher Affiliation	Academia	Kefan Su EMAIL School of Computer Science Peking University Zongqing Lu EMAIL School of Computer Science Peking University
Pseudocode	Yes	The practical algorithm of TVPO is summarized in Algorithm 1 in Appendix C.
Open Source Code	Yes	To ensure reproducibility, our codes are included in the supplementary material and will be open source upon acceptance.
Open Datasets	Yes	SMAC (Samvelyan et al., 2019), multi-agent Mu Jo Co (Peng et al., 2021) and SMACv2 (Ellis et al., 2023)
Dataset Splits	No	No explicit dataset splits are mentioned in the paper. The paper focuses on evaluating performance on various tasks within different environments (SMAC, Multi-Agent Mu Jo Co, SMACv2, MPE), which inherently define the test scenarios, but do not involve traditional training/test/validation splits of a pre-collected static dataset.
Hardware Specification	Yes	We perform the whole experiment with a total of four NVIDIA A100 GPUs.
Software Dependencies	Yes	The version of the game Star Craft2 in SMAC is 4.10 for our experiments in all the SMAC tasks.
Experiment Setup	Yes	We use 3-layer MLPs for the actor and the critic and use Re LU as non-linearities. The number of the hidden units of the MLP is 128. We train all the networks with an Adam optimizer. The learning rates of the actor and critic are both 5e-4. The number of epochs for every batch of samples is 15 which is the recommended value in Yu et al. (2021). For IPPO, the clip parameter is 0.2 which is the same as Schulman et al. (2017). For DPO, the hyperparameter is set as the original paper (Su & Lu, 2022b) recommends. Our code of IQL is based on the open-source code2 Py MARL (Apache-2.0 license) and we modify the code for individual parameters. The default architecture in Py MARL is RNN so we just follow it and the number of the hidden units is 128. The learning rate of IQL is also 5e-4. The architectures of the actor and critic of IDDPG are 3-layer MLPs. The learning rates of the actor and critic are both 5e-4. Our code of I2Q is from the open source code3 of the original paper (Jiang & Lu, 2022). We keep the hyperparameter of I2Q the same as the default value of the open-source code in our experiments. The version of the game Star Craft2 in SMAC is 4.10 for our experiments in all the SMAC tasks. We set the episode length of all the multi-agent Mu Jo Co tasks as 1000 in all of our multi-agent Mu Jo Co experiments. We perform the whole experiment with a total of four NVIDIA A100 GPUs. We have summarized the hyperparameters in Table 4.