Discovering Diverse Multi-Agent Strategic Behavior via Reward Randomization
Authors: Zhenggang Tang, Chao Yu, Boyuan Chen, Huazhe Xu, Xiaolong Wang, Fei Fang, Simon Shaolei Du, Yu Wang, Yi Wu
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically show that even with state-of-the-art exploration techniques, PG fails to discover the risky cooperation strategies. In contrast, RPG discovers a surprisingly diverse set of human-interpretable strategies in all these games, including some non-trivial emergent behavior. |
| Researcher Affiliation | Academia | 1 Tsinghua University, 2 Shanghai Qi Zhi Institute, 3 UC Berkeley, 4 UCSD, 5 CMU, 6 Peking University, 7 University of Washington |
| Pseudocode | Yes | Algorithm 1: RPG: Reward-Randomized Policy Gradient |
| Open Source Code | Yes | The source code and example videos can be found in our website: https://sites.google. com/view/staghuntrpg. |
| Open Datasets | Yes | A new multi-agent environment Agar.io, which allows complex multi-agent strategic behavior. We released the environment to the community as a novel testbed for MARL research. [and] We consider two games adapted from Peysakhovich & Lerer (2018b), Monster-Hunt and Escalation. |
| Dataset Splits | No | The paper mentions 'evaluation results are averaged over 100 episodes in gridworlds and 1000 episodes in Agar.io' and 'We repeat all the experiments with 3 seeds', implying testing. However, it does not explicitly define distinct 'training', 'validation', and 'test' dataset splits with specific percentages or counts. |
| Hardware Specification | No | The paper describes the training process using PPO, Adam optimizer, and GRU modules, but does not provide any specific details about the hardware (e.g., GPU/CPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions software components like PPO, Adam optimizer, and GRU but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | More optimization hyper-parameter settings are in Tab.6. In addition, Monster-Hunt also utilizes GRU modules to infer opponent s identity during adaption training and the parallel threads are set to 64. [and] More optimization hyper-parameter settings are in Tab.7. |