Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

On Efficient Reinforcement Learning for Full-length Game of StarCraft II

Authors: Ruo-Ze Liu, Zhen-Jia Pang, Zhou-Yu Meng, Wenhai Wang, Yang Yu, Tong Lu

JAIR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we investigate a set of RL techniques for the full-length game of Star Craft II. We achieve a win rate of 99% against the difficulty level-1 built-in AI. Through the curriculum transfer learning algorithm and a mixture of combat models, we achieve a 93% win rate against the most difficult non-cheating level built-in AI (level-7). By a final 3-layer hierarchical architecture and applying significant tricks to train SC2 agents, we increase the win rate against the level-8, level-9, and level-10 to 96%, 97%, and 94%, respectively. Our codes and models are all open-sourced now at https://github.com/liuruoze/Hier Net-SC2. We then can compare our work with m AS using the same computing resources and training time. By experiment results, we show that our method is more effective when using limited resources.
Researcher Affiliation Academia Ruo-Ze Liu EMAIL Zhen-Jia Pang EMAIL Zhou-Yu Meng EMAIL Wenhai Wang EMAIL Yang Yu EMAIL Tong Lu EMAIL (Corresponding author) National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China
Pseudocode Yes Algorithm 1 The proposed HRL training algorithm
Open Source Code Yes Our codes and models are all open-sourced now at https://github.com/liuruoze/Hier Net-SC2. ... The inference and training codes of mini-Alpha Star are all open-sourced at https://github.com/liuruoze/mini-Alpha Star.
Open Datasets Yes We use a replay pack of the 3.16.1 version provided by Blizzard.
Dataset Splits No We select some (e.g., 100) replays from all replays and divide them into one training set and one test set.
Hardware Specification Yes We train the agent on a single machine with 4 GPUs and 48 CPU threads. ... This machine is the same one as we train our hierarchical approach: 48 cores Intel(R) Xeon(R) Gold 6248 CPU 2.50GHz, a memory of 400G, disk space of 1T, and 8 NVIDIA Tesla V100 32G GPUs (in the hierarchical approach, we only use 4 of them).
Software Dependencies No The Python interface for SC2LE is called Py SC2. We use the 3.16.1 version of SC2, the first support version by Py SC2.
Experiment Setup Yes The γ value is 1. The λ in the generalized advantage estimation is set to 1. The clip value ϵ in the PPO is 0.1. For the coefficient of the value network, the c1 value is 0.01. For the coefficient of the entropy, the c2 value is set to 10 5. The learning rate is set to 10 4. The batch size of PPO is 64. We run the batch 20 epochs in each update of PPO. In the final 3-layer hierarchical architecture, most values of these parameters have been optimized. The episodes in one updating is 1000. The γ value is 0.9995. The λ in the generalized advantage estimation is set to 0.9995. The clip value ϵ in the PPO is 0.2. For the coefficient of the value network, the c1 value is 0.5. For the coefficient of the entropy, the c2 value is set to 10 3. The batch size of PPO is 512.