Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Empirical Study on Robustness and Resilience in Cooperative Multi-Agent Reinforcement Learning

Authors: Simin Li, Zihao Mao, Hanxiao Li, Zonglei Jing, Zhuohang bian, Jun Guo, Li Wang, Zhuoran Han, Ruixiao Xu, Xin Yu, Chengdong Ma, Yuqing Ma, Bo An, Yaodong Yang, Weifeng Lv, Xianglong Liu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper, we present a large-scale empirical study comprising over 82,620 experiments to evaluate cooperation, robustness, and resilience in MARL across 4 real-world environments, 13 uncertainty types, and 15 hyperparameters. Our key findings are: (1) Under mild uncertainty, optimizing cooperation improves robustness and resilience, but this link weakens as perturbations intensify. Robustness and resilience also varies by algorithm and uncertainty type. (2) Robustness and resilience do not generalize across uncertainty modalities or agent scopes: policies robust to action noise for all agents may fail under observation noise on a single agent. (3) Hyperparameter tuning is critical for trustworthy MARL: surprisingly, standard practices like parameter sharing, GAE, and Pop Art can hurt robustness, while early stopping, high critic learning rates, and Leaky Re LU consistently help. By optimizing hyperparameters only, we observe substantial improvement in cooperation, robustness and resilience across all MARL backbones, with the phenomenon also generalizing to robust MARL methods across these backbones. Code and results available at https://github.com/BUAA-Trustworthy MARL/adv_marl_benchmark.
Researcher Affiliation	Academia	Simin Li1, Zihao Mao1, Hanxiao Li1, Zonglei Jing1, Zhuohang Bian1, Jun Guo1, Li Wang1 Zhuoran Han1, Ruixiao Xu1, Xin Yu1, Chengdong Ma3, Yuqing Ma5, Bo An6 Yaodong Yang3 , Weifeng Lv1 , Xianglong Liu1, 2, 4 1State Key Laboratory of Complex & Critical Software Environment, Beihang University, China 2Zhongguancun Laboratory, China 3Institute of Artificial Intelligence, Peking University, China 4Institute of Data Space, Hefei Comprehensive National Science Center, China 5Institute of Artificial Intelligence, Beihang University, China 6Nanyang Technological University, Singapore
Pseudocode	No	The paper describes experimental procedures and findings but does not contain any explicit pseudocode blocks or algorithms in a structured format within the main text or appendix.
Open Source Code	Yes	Code and results available at https://github.com/BUAA-Trustworthy MARL/adv_marl_benchmark.
Open Datasets	Yes	Our study incorporates four real-world environments encompassing diverse task types, control modes, episode lengths, simulation engines, data sources, and control challenges, as summarized in Table. 2. The first two environments, Dexterous Hand Manipulation (Dexhand) [53] and Quadrotor Swarm Control (Quad) [54] are grounded in real-world applications, allowing policies learned in simulation to be directly transferred to physical robots with the same dynamic. ... The remaining two environments, Intelligent Traffic Control (Traffic) [55] and Active Voltage Control (Voltage) [56] are constructed from real-world data, ensuring high-fidelity replication of real-world dynamics.
Dataset Splits	No	The paper discusses training and evaluating models across various tasks and uncertainty settings (cooperative baseline, 13 robustness evaluations, and 13 resilience evaluations), but it does not specify traditional training/test/validation dataset splits for a fixed dataset in the context of the experiments conducted. The evaluation involves different scenarios and conditions for the MARL algorithms, not pre-split static datasets.
Hardware Specification	Yes	Our full experiment takes 230K GPU hours, measured in GTX 4090.
Software Dependencies	No	The paper mentions common MARL algorithms (MADDPG, MAPPO, HAPPO) and refers to codebases (Epymarl [42], MAPPO [3], Pymarlv2 [43]), but it does not explicitly list specific versions of programming languages, libraries, or frameworks used for its implementation.
Experiment Setup	Yes	In this section, we describe the hyperparameters, types of uncertainties, environments, algorithms used for our evaluation. Our full experiment takes 230K GPU hours, measured in GTX 4090. Table 1: General and algorithm-specific hyperparameters shared for all methods. Default choices are shown in bold font, which is shared by all algorithms and environments. Hyperparameters. For generality, we pick hyperparameters that are used by most methods. Specifically, we consider both general and algorithmspecific hyperparameters. General hyperparameters includes network hidden size, discount factor, activation function, initialization method, neural network type, learning rate, critic learning rate, feature normalization, share parameters and early stop. For MADDPG, algorithm-specific choices includes N-step TD and exploration noise. For MAPPO and HAPPO, algorithmspecific choices includes entropy coefficient, Use GAE and Use Pop Art. The descriptions of each hyperparameters and the reasons for selecting them are deferred to Appendix A.1.1. The values of each hyperparameters are shown in Table. 1. We use bold font to denote the default choices, which are shared for all algorithms and environments. This leads to 15 different hyperparameters. In each time, we vary one different hyperparameters to test its effect, resulting in 34 models with different implementations. Other hyperparameters used in our experiments are listed in Appendix. A.1.2.