Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Evaluating LLMs in Open-Source Games

Authors: Swadesh Sistla, Max Kleiman-Weiner

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate the capabilities of leading open-and closed-weight LLMs to predict and classify program strategies and evaluate features of the approximate program equilibria reached by LLM agents in dyadic and evolutionary settings. We make the following contributions: 1. LLMs can understand strategic code: In Section 4, we introduce SPARC, a strategic code classification benchmark that evaluates LLMs ability to understand and predict the cooperative behavior of >230 human-written programs for the Iterated Prisoner s Dilemma (IPD). Benchmarking current state of the art LLMs, open-weight models and reasoning models on SPARC shows robust game-theoretic code reasoning (top models >85%).
Researcher Affiliation Academia Swadesh Sistla University of Washington EMAIL Max Kleiman-Weiner University of Washington EMAIL
Pseudocode No The paper includes Python code examples under 'Program Implementation' and 'Skeleton for the developer' in Appendix A.3, but these are not labeled as 'Pseudocode' or 'Algorithm'.
Open Source Code Yes Code to reproduce our results is available at: https://github.com/swadeshs/llm-osgt
Open Datasets Yes The SPARC dataset is comprised of 239 strategies sourced from the Axelrod Python library [25].
Dataset Splits No The SPARC dataset is comprised of 239 strategies sourced from the Axelrod Python library [25]. This library serves as a high-quality repository for IPD research, covering a diversity of algorithmic approaches. The paper describes using the entire dataset for evaluation rather than splitting it for distinct training, validation, and testing phases for the LLMs under test.
Hardware Specification No Experiments were conducted using commercially available LLM APIs. These were accessed with standard API providers (primarily through Hugging Face and the Open AI API), and cost less than $50 across experiments. The paper does not provide specific hardware details (e.g., GPU models, CPU types, memory) as the experiments were conducted via LLM APIs.
Software Dependencies No The paper mentions software like the Axelrod Python library, the Radon library for Python code analysis, and the Carbon obfuscator, but does not provide specific version numbers for these or other key software components used in their methodology.
Experiment Setup Yes A.5.2 Experimental Settings details the configurations for each experiment: SPARC Benchmark includes Prompting (Zero-Shot and CoT) and IPD Rounds (r=10); Program Games specify LLM (Kimi-K2), Number of Seeds (10), Meta-Rounds (10), and IPD Rounds per Meta-Round (10); Evolutionary Dynamics lists IPD Rounds for Payoff Matrix (50-shot) and Replicator Dynamics. Additionally, A.5.3 states 'Temperature was set to 0.7, and tokens were limited to a maximum of 3500 per query'.