Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

[Re] Benchmarking LLM Capabilities in Negotiation through Scoreable Games

Authors: Jorge Carrasco Pollo, Ioannis Kapetangeorgis, Joshua Rosenthal, John Hua Yao

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our work investigates the reproducibility of claims in their benchmark, and provides a deeper understanding of its usability and generalizability. We replicate the original experiments on additional models, and introduce additional metrics to verify negotiation quality and evenness of evaluation.
Researcher Affiliation	Academia	Jorge Carrasco Pollo EMAIL Informatics Institute University of Amsterdam Ioannis Kapetangeorgis EMAIL Informatics Institute University of Amsterdam Joshua Rosenthal EMAIL Informatics Institute University of Amsterdam John Hua Yao EMAIL Informatics Institute University of Amsterdam
Pseudocode	No	The paper describes methods and procedures in paragraph text and tables, but does not include any clearly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code	Yes	Our implementation: https://github.com/joshrosie/FACT29
Open Datasets	Yes	In response, Abdelnabi et al. (2024) introduce a benchmark grounded in Scoreable Games (Susskind, 1985), a round-based multi-agent discussion format designed to teach negotiation. This proposed framework evaluates... 5 pre-configured games are provided by the authors of the original paper
Dataset Splits	No	This paper focuses on a reproducibility study of a negotiation benchmark, which involves running simulation games. It evaluates performance over a number of 'game iterations' (e.g., 20 game iterations) rather than traditional training/test/validation splits of a dataset.
Hardware Specification	Yes	The experiments were conducted using local machines equipped with NVIDIA A100 and A6000 GPUs, with memory capacities of 40 GB and 48 GB, respectively.
Software Dependencies	No	The experiments were conducted on Linux-based systems. For Open AI models (i.e., GPT-4o and GPT-4o mini), the Open AI API was utilized, whereas for open-source models, the Hugging Face transformers library was employed. Additionally, the codecarbon library was used to monitor carbon emissions during the computational processes.
Experiment Setup	Yes	The negotiation begins with a predefined deal from p1 (the original paper provides a default deal for each game). Afterward, players propose deals in a randomized sequence for R rounds (R = 24 by default), and after the last round, p1 proposes a final deal. ... We run the ablation study on more than one model, namely GPT-4o mini and Qwen2.5-72B. ... We detail our full ablation configurations in Table 2.