reproducibilityindex.ai

Stabilizing Q Learning Via Soft Mellowmax Operator

Authors: Yaozhong Gan, Zhe Zhang, Xiaoyang Tan7501-7509

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experimental results on the Star Craft Multi-agent Challenge (SMAC) (Vinyals et al. 2017) benchmark show that our proposed method signiﬁcantly improves the performance and sample efﬁciency compared to several state of the art MARL algorithms.
Researcher Affiliation	Academia	Yaozhong Gan, Zhe Zhang, Xiaoyang Tan College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics MIIT Key Laboratory of Pattern Analysis and Machine Intelligence {yzgancn, zhangzhe, x.tan}@nuaa.edu.cn
Pseudocode	No	The paper describes the proposed methods and mathematical formulations but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide any explicit statement or link to access the source code for the methodology described in the paper.
Open Datasets	Yes	For single agent RL, we used several games from Py Game Learning Environment (PLE) (Tasﬁ2016) and Min Atar (Young and Tian 2019); while for MARL, we adopted Star Craft Multi-agent Challenge (SMAC) (Samvelyan et al. 2019).
Dataset Splits	No	The paper mentions evaluating performance on '10 test episodes every 5000 training steps' and using a replay buffer, but it does not explicitly specify training, validation, and test dataset splits with percentages, counts, or references to predefined splits.
Hardware Specification	No	The paper describes the neural network architecture but does not provide any specific details about the hardware used to run the experiments, such as GPU or CPU models.
Software Dependencies	No	The paper mentions the use of RMSprop optimizer and various game environments like PLE, Min Atar, and SMAC, but it does not specify any software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	Yes	For PLE game environments, the neural network was a multi-layer perceptron with hidden layer ﬁxed to [64, 64]. The discount factor was 0.99. The size of the replay buffer was 10000. The weights of neural networks were optimized by RMSprop with gradient clip 5. The batch size was 32. The target network was updated every 200 frames. ϵ-greedy was applied as the exploration policy with ϵ decreasing linearly from 1.0 to 0.01 in 1, 000 steps. After 1, 000 steps, ϵ was ﬁxed to 0.01. The rest of the experimental setting are detailed in the appendix. We choose the candidates α and ω from the set of {1, 2, 5, 10, 15} for SM2, and choose ω among {1, 5, 10, 15, 20} for Softmax (Song, Parr, and Carin 2019). As for Mellowmax (Asadi and Littman 2017), we choose ω among {5, 10,30, 50, 100, 200}.