CivRealm: A Learning and Reasoning Odyssey in Civilization for Decision-Making Agents

Authors: Siyuan Qi, Shuo Chen, Yexin Li, Xiangyu Kong, Junqi Wang, Bangcheng Yang, Pring Wong, Yifan Zhong, Xiaoyuan Zhang, Zhaowei Zhang, Nian Liu, Yaodong Yang, Song-Chun Zhu

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To catalyze further research, we present initial results for both paradigms. The canonical RL-based agents exhibit reasonable performance in mini-games, whereas both RLand LLM-based agents struggle to make substantial progress in the full game.
Researcher Affiliation Collaboration 1National Key Laboratory of General Artificial Intelligence, BIGAI 2Peking University 3BUPT
Pseudocode No The paper describes the methods and network architectures in text and diagrams (e.g., Figure 5, Figure 11) but does not include any explicitly labeled "Pseudocode" or "Algorithm" blocks.
Open Source Code Yes The code is available at https://github.com/bigai-ai/civrealm.
Open Datasets No The paper describes the creation of mini-game "instances" and the use of "maps" for full games, which function as environments or tasks for training and evaluation. However, it does not provide concrete access information (e.g., specific links, DOIs, or formal citations) for a publicly available, conventionally split dataset (train/validation/test) used for the experiments.
Dataset Splits No The paper describes training models for a certain number of steps and on different mini-game instances or maps, but it does not specify any explicit train/validation/test dataset splits (e.g., percentages or sample counts) for reproducibility.
Hardware Specification No The paper mentions parallelizing tensor environments with Ray but does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments.
Software Dependencies No The paper mentions using "Proximal Policy Optimization (PPO)" and "Ray" for tensor-based RL, and "GPT3.5-turbo provided by Azure s Open AI API" for LLM experiments. However, it does not provide specific version numbers for any of these software components as required for a reproducible description.
Experiment Setup Yes We configured the actor update for 5 epochs, employing a clipped value loss with a clip parameter of 0.2, and using one mini-batch per epoch. The coefficients assigned to the entropy term and value loss were 0.01 and 0.001, respectively. The length of each episode was set at 125 steps, and we collected training data across 8 parallel environments. The learning rate for the Adam[41] optimizer was established at 0.0005, with an optimizer epsilon of 0.00001.