Discovering Diverse Multi-Agent Strategic Behavior via Reward Randomization

Authors: Zhenggang Tang, Chao Yu, Boyuan Chen, Huazhe Xu, Xiaolong Wang, Fei Fang, Simon Shaolei Du, Yu Wang, Yi Wu

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically show that even with state-of-the-art exploration techniques, PG fails to discover the risky cooperation strategies. In contrast, RPG discovers a surprisingly diverse set of human-interpretable strategies in all these games, including some non-trivial emergent behavior.
Researcher Affiliation Academia 1 Tsinghua University, 2 Shanghai Qi Zhi Institute, 3 UC Berkeley, 4 UCSD, 5 CMU, 6 Peking University, 7 University of Washington
Pseudocode Yes Algorithm 1: RPG: Reward-Randomized Policy Gradient
Open Source Code Yes The source code and example videos can be found in our website: https://sites.google. com/view/staghuntrpg.
Open Datasets Yes A new multi-agent environment Agar.io, which allows complex multi-agent strategic behavior. We released the environment to the community as a novel testbed for MARL research. [and] We consider two games adapted from Peysakhovich & Lerer (2018b), Monster-Hunt and Escalation.
Dataset Splits No The paper mentions 'evaluation results are averaged over 100 episodes in gridworlds and 1000 episodes in Agar.io' and 'We repeat all the experiments with 3 seeds', implying testing. However, it does not explicitly define distinct 'training', 'validation', and 'test' dataset splits with specific percentages or counts.
Hardware Specification No The paper describes the training process using PPO, Adam optimizer, and GRU modules, but does not provide any specific details about the hardware (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies No The paper mentions software components like PPO, Adam optimizer, and GRU but does not provide specific version numbers for these or other software dependencies.
Experiment Setup Yes More optimization hyper-parameter settings are in Tab.6. In addition, Monster-Hunt also utilizes GRU modules to infer opponent s identity during adaption training and the parallel threads are set to 64. [and] More optimization hyper-parameter settings are in Tab.7.