Large Language Models Play StarCraft II:Benchmarks and A Chain of Summarization Approach

Authors: Weiyu Ma, Qirui Mi, Yongcheng Zeng, Xue Yan, Runji Lin, Yuqiao Wu, Jun Wang, Haifeng Zhang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In our experiment, we detail the setup and key metrics (evaluation metrics detailed in Appendix A.2.) to evaluate macro-strategic decision-making in Star Craft II. We assess the Chain of Summarization s impact on LLM gameplay, compare various LLMs performance, and evaluate their grasp of Star Craft II strategies. Our experiments also concludes with human-AI interaction tests.
Researcher Affiliation Academia Weiyu Ma1,2, Qirui Mi1,2, Yongcheng Zeng1,2, Xue Yan1,2, Yuqiao Wu1,2, Runji Lin1,2, Haifeng Zhang1,2,4, Jun Wang3 1 Institute of Automation, Chinese Academy of Sciences, China 2 School of Artificial Intelligence, University of Chinese Academy of Sciences, China 3 AI Centre, Department of Computer Science, UCL 4 Nanjing Artificial Intelligence Research of IA, China
Pseudocode Yes The pseudocode is as shown in Algorithm 1.
Open Source Code Yes All code and data from this study have been made pulicly available at https://github.com/histmeisah/ Large-Language-Models-play-Star Craft II
Open Datasets Yes All code and data from this study have been made pulicly available at https://github.com/histmeisah/ Large-Language-Models-play-Star Craft II
Dataset Splits No The paper mentions 'Training Phase' and 'Testing Phase' and that 'we fine-tuned open-source models using the entire dataset of GPT3.5-turbo-16k interaction logs with Text Star Craft II', but it does not specify explicit training, validation, or test dataset splits (e.g., percentages or sample counts).
Hardware Specification Yes Training Phase: To fine-tune open-source Large Language Models, we utilized two NVIDIA A100 40GB GPUs... Testing Phase: The development and testing of these fine-tuned models were performed using two NVIDIA A100 40GB GPUs. For running the Star Craft II environment, we employed an NVIDIA Ge Force RTX 3060 GPU paired with a 13th Gen Intel(R) Core(TM) i5-13400F processor operating at 2.50 GHz
Software Dependencies No The paper mentions software components like 'python-sc2 framework', 'GPT3.5-turbo-16k', 'Llama2 70B', 'Chat GLM3 6B', 'Qwen 1.8B', and 'sc2reader library', but does not provide specific version numbers for these dependencies.
Experiment Setup Yes Agent and Opponent Selection: To ensure a consistent and controlled experimental environment, LLM agents are configured to play as Protoss against Zerg AI opponents. ... Map Selection: For our experiments, we selected the Altitude LE and Ancient Cistern LE maps... Parameter Settings: The temperature parameter is fixed at 0.1 to focus on strategy-driven actions over randomness. Game Version: Our experiments were conducted across three different versions of the game to ensure robustness and applicability of the results. The versions tested include Patch 5.0.11, Patch 5.0.12, and Patch 5.0.13.