Stabilizing Q Learning Via Soft Mellowmax Operator
Authors: Yaozhong Gan, Zhe Zhang, Xiaoyang Tan7501-7509
AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experimental results on the Star Craft Multi-agent Challenge (SMAC) (Vinyals et al. 2017) benchmark show that our proposed method significantly improves the performance and sample efficiency compared to several state of the art MARL algorithms. |
| Researcher Affiliation | Academia | Yaozhong Gan, Zhe Zhang, Xiaoyang Tan College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics MIIT Key Laboratory of Pattern Analysis and Machine Intelligence {yzgancn, zhangzhe, x.tan}@nuaa.edu.cn |
| Pseudocode | No | The paper describes the proposed methods and mathematical formulations but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any explicit statement or link to access the source code for the methodology described in the paper. |
| Open Datasets | Yes | For single agent RL, we used several games from Py Game Learning Environment (PLE) (Tasfi2016) and Min Atar (Young and Tian 2019); while for MARL, we adopted Star Craft Multi-agent Challenge (SMAC) (Samvelyan et al. 2019). |
| Dataset Splits | No | The paper mentions evaluating performance on '10 test episodes every 5000 training steps' and using a replay buffer, but it does not explicitly specify training, validation, and test dataset splits with percentages, counts, or references to predefined splits. |
| Hardware Specification | No | The paper describes the neural network architecture but does not provide any specific details about the hardware used to run the experiments, such as GPU or CPU models. |
| Software Dependencies | No | The paper mentions the use of RMSprop optimizer and various game environments like PLE, Min Atar, and SMAC, but it does not specify any software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | For PLE game environments, the neural network was a multi-layer perceptron with hidden layer fixed to [64, 64]. The discount factor was 0.99. The size of the replay buffer was 10000. The weights of neural networks were optimized by RMSprop with gradient clip 5. The batch size was 32. The target network was updated every 200 frames. ϵ-greedy was applied as the exploration policy with ϵ decreasing linearly from 1.0 to 0.01 in 1, 000 steps. After 1, 000 steps, ϵ was fixed to 0.01. The rest of the experimental setting are detailed in the appendix. We choose the candidates α and ω from the set of {1, 2, 5, 10, 15} for SM2, and choose ω among {1, 5, 10, 15, 20} for Softmax (Song, Parr, and Carin 2019). As for Mellowmax (Asadi and Littman 2017), we choose ω among {5, 10,30, 50, 100, 200}. |