UneVEn: Universal Value Exploration for Multi-Agent Reinforcement Learning

Authors: Tarun Gupta, Anuj Mahajan, Bei Peng, Wendelin Boehmer, Shimon Whiteson

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical results on a set of exploration games, challenging cooperative predator-prey tasks requiring significant coordination among agents, and Star Craft II micromanagement benchmarks show that Une VEn can solve tasks where other state-of-the-art MARL methods fail.
Researcher Affiliation Academia 1Department of Computer Science, University of Oxford, Oxford, United Kingdom 2Department of Software Technology, Delft University of Technology, Delft, Netherlands.
Pseudocode No The main text of the paper refers to "Appendix B" for a detailed algorithm, but Appendix B itself is not included in the provided text.
Open Source Code No The paper mentions videos of learnt policies available at a URL (https://sites.google.com/view/uneven-marl/) but does not provide a statement or link for the open-source code of their methodology.
Open Datasets Yes We now evaluate Une VEn on challenging cooperative Star Craft II (SC2) maps from the popular SMAC benchmark (Samvelyan et al., 2019).
Dataset Splits No The paper describes training duration in steps and testing with rollouts (e.g., 'training for 35k steps', 'test 60 rollouts') within simulation environments, but does not provide traditional train/validation/test dataset splits.
Hardware Specification No The paper acknowledges 'a generous equipment grant from NVIDIA' but does not specify any particular GPU models, CPU models, or other hardware specifications used for running the experiments.
Software Dependencies No The paper mentions various algorithms and benchmarks (e.g., VDN, QMIX, Star Craft II), but it does not specify software names with their version numbers required for replication.
Experiment Setup Yes α is annealed from 0.3 to 1.0 in our experiments over a fixed number of steps at the beginning of training. Once this exploration stage is finished (i.e., α = 1), actions are always taken based on the target task s joint action value function.