A Robust and Opponent-Aware League Training Method for StarCraft II

Authors: Ruozi Huang, Xipeng Wu, Hongsheng Yu, Zhong Fan, Haobo Fu, Qiang Fu, Wei Yang

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We validate our improvements by comparing ROA-Star to Alpha Star. Extensive experiments demonstrate that the exploiters in ROA-Star are more effective in detecting the weaknesses of the main agent and the entire league; that the main agent in ROA-Star responds to the opponent strategy more effectively; and overall that the main agent in ROA-Star is significantly stronger. We also conducted by far the most comprehensive AI vs. top human evaluations in Star Craft II, where our agent trained by ROA-Star achieved a winning rate above 50% in repeated games. A detailed comparison, in terms of the computational cost and human evaluation, between Alpha Star and ROA-Star is given in Table 1.
Researcher Affiliation Industry Tencent AI Lab, Shenzhen, China {rosiehuang,haroldwu,yannickyu,zhongfan,haobofu,leonfu,willyang}@tencent.com
Pseudocode No Figure 1 displays architectural diagrams and flowcharts, not pseudocode. No other sections or figures are labeled 'Pseudocode' or 'Algorithm'.
Open Source Code No The paper does not provide an explicit statement or link for the open-source code of their methodology. The link provided in Section A.1 refers to human replay data, not the authors' implementation code.
Open Datasets Yes Blizzard is releasing a large number of 1v1 replays played on the ladder. The instructions for how to download the replay files can be found at https://github.com/Blizzard/s2client-proto. We extracted a dataset from these replays which contains 120,938 Protoss vs. Protoss replays from Star Craft II versions 4.8.2 to 4.9.3. These replays were played by human players with MMR scores greater than 4100.
Dataset Splits No The paper does not provide explicit training/validation/test dataset splits for the main reinforcement learning process. While it mentions a 'test set of 3000 human replays' for the opponent prediction model in Appendix C.4, this is not a comprehensive split for the overall experimental reproduction.
Hardware Specification Yes Table 1: computational cost TPU or GPU 256 3rd-generation TPU cores 64 NVIDIA V100 GPUs CPU 4100 preemptible CPU cores 4600 standard CPU cores. For each agent, the full scale of computational resources contains 64 NVIDIA v100 GPUs and 4600 CPU cores.
Software Dependencies No The paper mentions algorithms like V-trace and UPGO, but does not specify software dependencies with version numbers (e.g., specific deep learning frameworks and their versions).
Experiment Setup Yes The training procedure of ROA-Star includes a supervised learning stage and a 50-day multi-agent reinforcement learning stage. ... MA is trained with strategy statistic z sampled from our strategy set D, and we set z to zero 10% of the time. A frozen copy of MA is added as a new player to the league with a period of every 2 108 steps. The LE fights with the whole league and adds a frozen copy into the league when it defeats all the players in the league with a win rate above 70% or reaches the timeout threshold of 2 108 steps. At this point, its parameters will be reset with a 25% probability. ME aims to find the weakness of MA, it adds the frozen copy in the league and reset parameters when defeating MA in more than 70% of games or after a timeout of 4 108 steps. ... each exploiter will be reset to various configurations with a proportion of 20% origin unconditional exploiter, 30% EIE, and 50% ERE. EIE reset to the current MA model. It samples from the z set with the top 10% win rate. ERE reset to the supervised model and condition on top 15% z in execution deviation. ... The average APM of any of our final agents models is less than 240 (with a peak APM below 800)... Our MA was continuously trained for 50 days, which consumed about 4.42 1010 steps.