Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

[Re] Benchmarking LLM Capabilities in Negotiation through Scoreable Games

Authors: Jorge Carrasco Pollo, Ioannis Kapetangeorgis, Joshua Rosenthal, John Hua Yao

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our work investigates the reproducibility of claims in their benchmark, and provides a deeper understanding of its usability and generalizability. We replicate the original experiments on additional models, and introduce additional metrics to verify negotiation quality and evenness of evaluation.
Researcher Affiliation Academia Jorge Carrasco Pollo EMAIL Informatics Institute University of Amsterdam Ioannis Kapetangeorgis EMAIL Informatics Institute University of Amsterdam Joshua Rosenthal EMAIL Informatics Institute University of Amsterdam John Hua Yao EMAIL Informatics Institute University of Amsterdam
Pseudocode No The paper describes methods and procedures in paragraph text and tables, but does not include any clearly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code Yes Our implementation: https://github.com/joshrosie/FACT29
Open Datasets Yes In response, Abdelnabi et al. (2024) introduce a benchmark grounded in Scoreable Games (Susskind, 1985), a round-based multi-agent discussion format designed to teach negotiation. This proposed framework evaluates... 5 pre-configured games are provided by the authors of the original paper
Dataset Splits No This paper focuses on a reproducibility study of a negotiation benchmark, which involves running simulation games. It evaluates performance over a number of 'game iterations' (e.g., 20 game iterations) rather than traditional training/test/validation splits of a dataset.
Hardware Specification Yes The experiments were conducted using local machines equipped with NVIDIA A100 and A6000 GPUs, with memory capacities of 40 GB and 48 GB, respectively.
Software Dependencies No The experiments were conducted on Linux-based systems. For Open AI models (i.e., GPT-4o and GPT-4o mini), the Open AI API was utilized, whereas for open-source models, the Hugging Face transformers library was employed. Additionally, the codecarbon library was used to monitor carbon emissions during the computational processes.
Experiment Setup Yes The negotiation begins with a predefined deal from p1 (the original paper provides a default deal for each game). Afterward, players propose deals in a randomized sequence for R rounds (R = 24 by default), and after the last round, p1 proposes a final deal. ... We run the ablation study on more than one model, namely GPT-4o mini and Qwen2.5-72B. ... We detail our full ablation configurations in Table 2.