Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Discovering Diverse Multi-Agent Strategic Behavior via Reward Randomization
Authors: Zhenggang Tang, Chao Yu, Boyuan Chen, Huazhe Xu, Xiaolong Wang, Fei Fang, Simon Shaolei Du, Yu Wang, Yi Wu
ICLR 2021 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically show that even with state-of-the-art exploration techniques, PG fails to discover the risky cooperation strategies. In contrast, RPG discovers a surprisingly diverse set of human-interpretable strategies in all these games, including some non-trivial emergent behavior. |
| Researcher Affiliation | Academia | 1 Tsinghua University, 2 Shanghai Qi Zhi Institute, 3 UC Berkeley, 4 UCSD, 5 CMU, 6 Peking University, 7 University of Washington |
| Pseudocode | Yes | Algorithm 1: RPG: Reward-Randomized Policy Gradient |
| Open Source Code | Yes | The source code and example videos can be found in our website: https://sites.google. com/view/staghuntrpg. |
| Open Datasets | Yes | A new multi-agent environment Agar.io, which allows complex multi-agent strategic behavior. We released the environment to the community as a novel testbed for MARL research. [and] We consider two games adapted from Peysakhovich & Lerer (2018b), Monster-Hunt and Escalation. |
| Dataset Splits | No | The paper mentions 'evaluation results are averaged over 100 episodes in gridworlds and 1000 episodes in Agar.io' and 'We repeat all the experiments with 3 seeds', implying testing. However, it does not explicitly define distinct 'training', 'validation', and 'test' dataset splits with specific percentages or counts. |
| Hardware Specification | No | The paper describes the training process using PPO, Adam optimizer, and GRU modules, but does not provide any specific details about the hardware (e.g., GPU/CPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions software components like PPO, Adam optimizer, and GRU but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | More optimization hyper-parameter settings are in Tab.6. In addition, Monster-Hunt also utilizes GRU modules to infer opponent s identity during adaption training and the parallel threads are set to 64. [and] More optimization hyper-parameter settings are in Tab.7. |