GTBench: Uncovering the Strategic Reasoning Capabilities of LLMs via Game-Theoretic Evaluations
Authors: Jinhao Duan, Renming Zhang, James Diffenderfer, Bhavya Kailkhura, Lichao Sun, Elias Stengel-Eskin, Mohit Bansal, Tianlong Chen, Kaidi Xu
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This paper evaluates LLMs reasoning abilities in competitive environments through game-theoretic tasks, e.g., board and card games that require pure logic and strategic reasoning to compete with opponents. We first propose GTBENCH, a language-driven environment composing 10 widely-recognized tasks, across a comprehensive game taxonomy: complete versus incomplete information, dynamic versus static, and probabilistic versus deterministic scenarios. Then, we ➊Characterize the game-theoretic reasoning of LLMs; and ➋Perform LLM-vs.-LLM competitions as reasoning evaluation. |
| Researcher Affiliation | Collaboration | Jinhao Duan1 Renming Zhang2 James Diffenderfer3 Bhavya Kailkhura3 Lichao Sun4 Elias Stengel-Eskin5 Mohit Bansal5 Tianlong Chen5,6,7 Kaidi Xu1 1Drexel University 2Boston University 3LLNL 4Lehigh University 5UNC Chapel Hill 6MIT 7Harvard University |
| Pseudocode | No | The paper does not contain explicitly labeled pseudocode or algorithm blocks. It describes processes and structures but not in a pseudocode format. |
| Open Source Code | Yes | The code and leaderboard will be public and continuously updated for future reasoning agents and LLMs. |
| Open Datasets | No | The paper evaluates existing Large Language Models (LLMs) and competitive game-theoretic environments. It does not describe training a new model on a specific publicly available dataset. While the game environments themselves are well-known, they are not presented as a 'dataset' in the context of being a specific collection of data points for training that is being made publicly available by the authors. |
| Dataset Splits | No | The paper does not describe explicit training, validation, or test dataset splits for model development, as it primarily evaluates pre-trained LLMs in game simulations. It mentions running '50 valid matches' for each competition, but this is for evaluation, not a dataset split for model training/validation. |
| Hardware Specification | No | The paper mentions that the results are obtained from 'endpoint API providers, e.g., Open AI (Section 4)', which implies the specific hardware used is not disclosed or directly controlled by the authors for these experiments. |
| Software Dependencies | No | The paper mentions: 'In this paper, all of the gaming environments are built on top of Open Spiel Lanctot et al. (2019)'. However, it does not provide a specific version number for Open Spiel or any other key software dependencies with version numbers. |
| Experiment Setup | Yes | Experimental Settings. We consider well-recognized LLMs such as commercial LLMs: GPT-3.5-turbo-1106 and GPT-4-0613 (Achiam et al., 2023), and open-source LLMs: Llama-3-70b Instruct (Meta, 2024), Deepseek-LLM-67b-chat (Bi et al., 2024), Llama-2-70b-chat (Touvron et al., 2023), Code Llama (Roziere et al., 2023), and Mistral-7b-Orca (Jiang et al., 2023a; Mukherjee et al., 2023). For all the LLMs, the temperature is set to 0.2 and the max number of generated tokens is 1024. For each competition, we run 50 valid matches. The final performance is measured by the averaged NRA over the 50 valid matches. To mitigate the first-player advantage, we have each participant take the first turn in 25 matches. |