Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

LLM Strategic Reasoning: Agentic Study through Behavioral Game Theory

Authors: Jingru Jia, Zehua Yuan, Junhao Pan, Paul E. McNamara, Deming Chen

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Using this framework, we evaluate 22 state-of-the-art LLMs across diverse strategic scenarios. We find models like GPT-o3-mini, GPT-o1, and Deep Seek-R1 lead in reasoning depth. Through thinking chain analysis, we identify distinct reasoning styles such as maximin or belief-based strategies and show that longer reasoning chains do not consistently yield better decisions. Furthermore, embedding demographic personas reveals context-sensitive shifts: some models (e.g., GPT4o, Claude-3-Opus) improve when assigned female identities, while others (e.g., Gemini 2.0) show diminished reasoning under minority sexuality personas.
Researcher Affiliation	Academia	Jingru Jia, Zehua Yuan, Junhao Pan, Paul E. Mc Namara, and Deming Chen University of Illinois at Urbana-Champaign EMAIL
Pseudocode	No	The paper describes the model specification and theoretical foundations using mathematical formulas in Section 2.2 and Appendix A, but it does not include explicitly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code	Yes	Code for experiments is provided in the supplementary material.
Open Datasets	Yes	We also include the SW10 matrix from (54), a benchmark for identifying human reasoning levels. The complete payoff matrix tables are attached in the Appendix. While our experiments use abstract normal-form games (N, {Ai}, {ui}), these payoffs directly mirror real-world interactions. [...] S-W 10 matrix is the matrix we borrow from the human research in behavioral economics in (54; 51).
Dataset Splits	No	The paper collects 30 independent trials for each LLM and each game to derive empirical choice frequencies for parameter estimation. However, it does not describe training, validation, or test dataset splits in the conventional sense, as it evaluates pre-trained LLMs rather than training new models with split datasets.
Hardware Specification	No	The experiments are lightweight and not resource-intensive. Reproduction can be achieved on standard hardware without specialized infrastructure.
Software Dependencies	No	The paper discusses interacting with LLMs via API or cloud services but does not provide specific software dependencies or version numbers (e.g., Python, PyTorch, etc.) used for running their analytical framework or experiments.
Experiment Setup	Yes	To quantify the average strategic reasoning depth (τ) from observed behaviors, we proceed as follows: Step 1. Empirical Choice Frequencies. For each LLM and each game, we collect 30 independent trials and record the frequency cij with which the model chose action aij. Step 2. Model-Implied Choice Probabilities. Under the bounded-rationality framework... Step 3. Maximum Likelihood Fit. We estimate (τ, γ) by maximizing the log-likelihood of the observed counts under the model: max τ,γ sum(i,j) cij ln pij(τ, γ). We develop a library of 13 matrix games spanning 7 core types from behavioral game theory, grouped into complete-information (fully known payoffs) and incomplete-information (need to reason with uncertainty) settings. The prompts used in the demographic-feature embedded experiment consist of two parts. The main body of the questions remains consistent with those in the vanilla prompt design. However, each prompt is augmented with a demographic component based on the template provided in Table 24. This demographic information is added at the beginning of each prompt to ensure that the LLM retains all features throughout the interaction, preventing memory loss during long-term conversations. The prompts used in the Co T embedding experiment are straightforward. Following the zero-shot Co T approach, a few additional sentences are appended to the end of each prompt to activate the Co T feature and enhance the reasoning capability of the model. Examples of the added Co T prompts are provided in Table 25.