Stabilizing Q Learning Via Soft Mellowmax Operator

Authors: Yaozhong Gan, Zhe Zhang, Xiaoyang Tan7501-7509

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experimental results on the Star Craft Multi-agent Challenge (SMAC) (Vinyals et al. 2017) benchmark show that our proposed method significantly improves the performance and sample efficiency compared to several state of the art MARL algorithms.
Researcher Affiliation Academia Yaozhong Gan, Zhe Zhang, Xiaoyang Tan College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics MIIT Key Laboratory of Pattern Analysis and Machine Intelligence {yzgancn, zhangzhe, x.tan}@nuaa.edu.cn
Pseudocode No The paper describes the proposed methods and mathematical formulations but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any explicit statement or link to access the source code for the methodology described in the paper.
Open Datasets Yes For single agent RL, we used several games from Py Game Learning Environment (PLE) (Tasfi2016) and Min Atar (Young and Tian 2019); while for MARL, we adopted Star Craft Multi-agent Challenge (SMAC) (Samvelyan et al. 2019).
Dataset Splits No The paper mentions evaluating performance on '10 test episodes every 5000 training steps' and using a replay buffer, but it does not explicitly specify training, validation, and test dataset splits with percentages, counts, or references to predefined splits.
Hardware Specification No The paper describes the neural network architecture but does not provide any specific details about the hardware used to run the experiments, such as GPU or CPU models.
Software Dependencies No The paper mentions the use of RMSprop optimizer and various game environments like PLE, Min Atar, and SMAC, but it does not specify any software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes For PLE game environments, the neural network was a multi-layer perceptron with hidden layer fixed to [64, 64]. The discount factor was 0.99. The size of the replay buffer was 10000. The weights of neural networks were optimized by RMSprop with gradient clip 5. The batch size was 32. The target network was updated every 200 frames. ϵ-greedy was applied as the exploration policy with ϵ decreasing linearly from 1.0 to 0.01 in 1, 000 steps. After 1, 000 steps, ϵ was fixed to 0.01. The rest of the experimental setting are detailed in the appendix. We choose the candidates α and ω from the set of {1, 2, 5, 10, 15} for SM2, and choose ω among {1, 5, 10, 15, 20} for Softmax (Song, Parr, and Carin 2019). As for Mellowmax (Asadi and Littman 2017), we choose ω among {5, 10,30, 50, 100, 200}.