Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

LLM-PySC2: Starcraft II learning environment for Large Language Models

Authors: Zongyuan Li, Yanan Ni, Runnan Qi, Chang Lu, Lumin Jiang, Xu Xiaojie, Xiangbei Liu, Pengfei Li, Yunzheng Guo, Zhe Ma, Huanyu Li, wu hui, Xian Guo, Kuihua Huang, Xuebo Zhang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In the experiments, we evaluated LLMs decision-making performance in both the macro-decision and micro-operation scenarios, with traditional Star Craft II Multi-Agent Challenge (SMAC) tasks and a series of new proposed. Results indicate that LLMs possess the potential to achieve victories in complex scenarios but cannot constantly generate correct decisions, especially in the recovered pysc2 action space and MA settings.
Researcher Affiliation Academia 1College of Artificial Intelligence, Nankai University, Tianjin, China 2Laboratory for Big Data and Decision, National University of Defense, Changsha, China
Pseudocode Yes Appendix A. Pseudo Code Algorithm 1 LLM-Py SC2 Rollout Process Algorithm 2 Query Process for an Agent
Open Source Code Yes Our code is available in Anonymous Git Hub (link: https://anonymous. 4open.science/r/LLM-Py SC2-Anonymous-0E0D), and the experiment results mentioned in the paper can be reproduced by source code.
Open Datasets Yes To provide support for LLM decision-making, we developed LLM-Py SC2, an environment derived from the Star Craft II Learning Environment (SC2LE)(35). ... Unlike the SMAC(37) tasks, these tasks require more on task understanding and usage of unit skills.
Dataset Splits No In the macro-operation tasks (complete Star Craft II games), we conducted 30 repeated experiments from level-1 (very easy) to level-7 (very hard/elite). ... In the micro-operation tasks, we conducted 20 repeated experiments for each LLM (except GPT3.5-turbo which evaluates 50 games).
Hardware Specification Yes Table D1: System settings Module Recommand Minimum requirements System Windows-10 or 11 Windows-10 CPU i9-14900, 24 cores 32 threads 8 core GPU Ge Force RTX 4090, 24G Ge Force GTX 1080 Storage 64G RAM +2T SSD 8G RAM + 100G SSD
Software Dependencies Yes Table D1: System settings Starcraft II Version 9.0.14(93333) Version 9.0.14(93333)
Experiment Setup Yes In this section, we introduce two series of experiments: (1) Experiments for macro-decisions, i.e. complete Star Craft II game; (2) Experiments for micro-operations, including classic SMAC scenarios and eight new tasks that require units to use their skills and achieve assigned goal. ... In the macro-operation tasks (complete Star Craft II games), we conducted 30 repeated experiments from level-1 (very easy) to level-7 (very hard/elite). As shown in Table 1, two agents in the ECEB mode control the whole system via discrete actions and perform nearly the same as in TSC2 (6). At level-5, LLMs can only win about 30% of the games and nearly lose all games at level-6 and above.