Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

MARLlib: A Scalable and Efficient Multi-agent Reinforcement Learning Library

Authors: Siyi Hu, Yifan Zhong, Minquan Gao, Weixun Wang, Hao Dong, Xiaodan Liang, Zhihui Li, Xiaojun Chang, Yaodong Yang

JMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conducted experiments to demonstrate the eﬃciency of MARLlib compared to EPy MARL and the on-policy baseline (oﬃcial MAPPO (Yu et al., 2022)). The experiments were performed on a local server with an NVIDIA RTX A6000 GPU and an AMD Ryzen Threadripper PRO 5945WX 12-Cores CPU. The testing scenario is MMM2 from SMAC (Samvelyan et al., 2019), and the testing algorithm is MAPPO. The total consumed timesteps are 10^6. From Table. 2, it is evident that MARLlib is signiﬁcantly more eﬃcient than the other frameworks in terms of clock time... In this section, we conducted a comprehensive evaluation of 17 algorithms on 23 tasks from ﬁve widely-used MARL testing environments, namely SMAC (Samvelyan et al., 2019), MPE (Lowe et al., 2017), GRF (Kurach et al., 2020), MAMu Jo Co (Peng et al., 2021), and MAgent (Zheng et al., 2018). We selected these environments for their popularity in MARL research and their diversity in task modes, observation shapes, additional information, action spaces, sparse or dense rewards, and homogeneous or heterogeneous agent types. The evaluation involved running each algorithm on each task with four diﬀerent random seeds, resulting in over one thousand experiments in total. We measured the mean return achieved by each algorithm across these experiments. The results of our experiments are presented in Table 4 and Figure 6.
Researcher Affiliation	Collaboration	Siyi Hu1 EMAIL Yifan Zhong2 EMAIL Minquan Gao2 EMAIL Weixun Wang3 EMAIL Hao Dong2 EMAIL Xiaodan Liang4,6 EMAIL Zhihui Li5 EMAIL Xiaojun Chang1,4 EMAIL Yaodong Yang2 EMAIL 1 Re LER, AAII, University of Technology Sydney 2 Institute for Artiﬁcial Intelligence, Peking University 3 Net Ease Fuxi AI Lab 4 MBZUAI 5 Shandong Artiﬁcial Intelligence Institute, Qilu University of Technology 6 School of Intelligent Systems Engineering, Sun Yat-sen University
Pseudocode	No	The paper describes the design and implementation of MARLlib and its components but does not include any clearly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code	Yes	The MARLlib library s source code is publicly accessible on Git Hub: https://github.com/Replicable-MARL/MARLlib.
Open Datasets	Yes	We conducted a comprehensive evaluation of 17 algorithms on 23 tasks from ﬁve widely-used MARL testing environments, namely SMAC (Samvelyan et al., 2019), MPE (Lowe et al., 2017), GRF (Kurach et al., 2020), MAMu Jo Co (Peng et al., 2021), and MAgent (Zheng et al., 2018).
Dataset Splits	No	The paper uses simulation environments where data is generated through interaction, and specifies training duration in terms of timesteps (e.g., 'total consumed timesteps are 10^6'). It does not provide specific train/test/validation splits for a pre-collected, fixed dataset.
Hardware Specification	Yes	We conducted experiments to demonstrate the eﬃciency of MARLlib compared to EPy MARL and the on-policy baseline (oﬃcial MAPPO (Yu et al., 2022)). The experiments were performed on a local server with an NVIDIA RTX A6000 GPU and an AMD Ryzen Threadripper PRO 5945WX 12-Cores CPU.
Software Dependencies	Yes	We have tested the installation on Python 3.8 with both Ubuntu 18.04 and Ubuntu 20.04... # recommend always keeping the gym version at 0.21.0. $ pip install gym==0.21.0
Experiment Setup	Yes	mappo.fit(env, model, stop={'timesteps_total': 1000000}, checkpoint_freq=100, share_policy='group')... The total consumed timesteps are 10^6. From Table. 2, it is evident that MARLlib is signiﬁcantly more eﬃcient than the other frameworks in terms of clock time... The results obtained by EPy MARL involved 40 million steps for on-policy algorithms and four million steps for oﬀ-policy algorithms. In contrast, MARLlib consumed only half of these steps for training as we found it suﬃcient for convergence.