Disentangling Sources of Risk for Distributional Multi-Agent Reinforcement Learning

Authors: Kyunghwan Son, Junsu Kim, Sungsoo Ahn, Roben D Delos Reyes, Yung Yi, Jinwoo Shin

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments demonstrate that DRIMA significantly outperforms prior state-of-the-art methods across various scenarios in the Star Craft Multiagent Challenge environment. Notably, DRIMA shows robust performance where prior methods learn only a highly suboptimal policy, regardless of reward shaping, exploration scheduling, and noisy (random or adversarial) agents.
Researcher Affiliation Academia Kyunghwan Son 1 Junsu Kim 1 Sungsoo Ahn 2 Roben Delos Reyes 1 Yung Yi 1 Jinwoo Shin 1 1Korea Advanced Institute of Science and Technology (KAIST) 2Pohang University of Science and Technology (POSTECH). Correspondence to: Kyunghwan Son <kevinson9473@kaist.ac.kr>.
Pseudocode Yes Algorithm 1 DRIMA algorithm
Open Source Code No The paper provides links to repositories for external tools and baselines (e.g., SMAC, PyMARL, WQMIX, QPLEX, DFAC), but does not provide a specific link or explicit statement about releasing the source code for the DRIMA methodology itself.
Open Datasets Yes Environments. We mainly evaluate our method on the Starcraft Multi-Agent Challenge (SMAC) environment (Samvelyan et al., 2019).
Dataset Splits No The paper mentions a replay buffer size and mini-batch size for training, and specifies test episodes, but it does not provide explicit train/validation/test dataset splits (e.g., percentages or exact counts for each split).
Hardware Specification Yes Using a Nvidia Titan Xp graphic card, the training time varies from 8 hours to 24 hours for different scenarios.
Software Dependencies Yes The hyperparameters of training and testing configurations for VDN, QMIX, and QTRAN are the same as in the recent Git Hub code of SMAC 3 (Samvelyan et al., 2019) and Py MARL 4 with Star Craft version SC2.4.6.2.69232
Experiment Setup Yes We used the Adam optimizer. For other methods except for DRIMA and DFAC, according to their papers, all neural networks are trained using the RMSProp optimizer with a 0.0005 learning rate. We use ϵ-greedy action selection with decreasing ϵ from 1 to 0.05 for exploration, following Samvelyan et al. (2019). For the discount factor, we set γ = 0.99. The replay buffer stores 5000 episodes at most, and the mini-batch is 32. [...] We set λopt = 3 and λnopt, λub = 1.