Mastering Complex Control in MOBA Games with Deep Reinforcement Learning

Authors: Deheng Ye, Zhao Liu, Mingfei Sun, Bei Shi, Peilin Zhao, Hao Wu, Hongsheng Yu, Shaojie Yang, Xipeng Wu, Qingwei Guo, Qiaobo Chen, Yinyuting Yin, Hao Zhang, Tengfei Shi, Liang Wang, Qiang Fu, Wei Yang, Lanxiao Huang6672-6679

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments show that the trained AI agent can defeat top professional human players on different hero types, tested on the 1v1 mode of Honor of Kings, a popular MOBA game.
Researcher Affiliation Industry 1Tencent AI Lab, Shenzhen, China 2Tencent Timi Studio, Chengdu, China {dericye, ricardoliu, mingfeisun, beishi, masonzhao, alberthwu, yannickyu, shaojieyang, haroldwu, leoqwguo, ciaochen, mailyyin, howezhang, francisshi, enginewang, leonfu, willyang, jackiehuang}@tencent.com
Pseudocode No The paper describes the algorithm and network architecture in text and with diagrams, but does not include a formal pseudocode or algorithm block.
Open Source Code No As a next step, we will make our framework and algorithm open source, and the game core of Honor of Kings accessible to the community to facilitate further research on complex games; and we will also provide part of our computing resources via virtual cloud for public use 3. By Nov. 21, 2019, the Beta version is open to 4 universities in China for user feedback.
Open Datasets No We test our method by using the 1v1 mode in Honor of Kings, which is the most popular MOBA game nowadays, and has been actively used as the testbed for recent RL advances (Eisenach et al. 2019; Wang et al. 2018; Jiang, Ekwedike, and Liu 2018).
Dataset Splits No The paper does not provide specific dataset split information (e.g., percentages, sample counts) for training, validation, and test sets. It describes the overall evaluation process against human players but not explicit data partitioning for model validation.
Hardware Specification Yes Our framework runs over a total number of 600,000 CPU cores encapsulated in Dockers and 1,064 Nvidia GPUs (a mixture of Tesla P40 and V100). To train one hero, we use 48 P40 GPU cards and 18,000 CPU cores.
Software Dependencies No The paper mentions software like Feather CNN, Tensorflow, Caffe, and Adam optimizer but does not provide specific version numbers for these dependencies.
Experiment Setup Yes To train one hero, we use 48 P40 GPU cards and 18,000 CPU cores. The minibatch size per GPU card is 4096. The time step and unit size of the LSTM are 16 and 1024, respectively. We train using full rollouts, i.e., one episode ends until the termination of the game, and we use zero-start, i.e., the agent starts the game from Frame 0. We use Adam optimizer with initial learning rate 0.0001. In the dual-clipped PPO, the two clipping hyperparameters ϵ and c are set as 0.2 and 3, respectively. The discount factor is set as 0.997.