reproducibilityindex.ai

How Well Can LLMs Negotiate? NegotiationArena Platform and Analysis

Authors: Federico Bianchi, Patrick John Chia, Mert Yuksekgonul, Jacopo Tagliabue, Dan Jurafsky, James Zou

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper, we study how well LLMs can negotiate with each other. We develop NEGOTIATIONARENA: a flexible framework for evaluating and probing the negotiation abilities of LLM agents. We implemented three types of scenarios in NEGOTIATIONARENA to assess LLM s behaviors in allocating shared resources (ultimatum games), aggregate resources (trading games) and buy/sell goods (price negotiations).
Researcher Affiliation	Collaboration	1Stanford University, Stanford, California 2Independent 3Bauplan, New York, New York.
Pseudocode	No	The paper describes the system implementation and communication format using XML-like tags, but it does not include any formal pseudocode or algorithm blocks.
Open Source Code	Yes	Our contributions: We propose NEGOTIATIONARENA: an open-source framework to evaluate and probe the negotiation abilities of LLM agents. 1NEGOTIATIONARENA is available at https://github. com/vinid/Negotiation Arena.
Open Datasets	No	The paper defines and implements custom negotiation scenarios rather than using a pre-existing publicly available dataset. It describes how interactions are generated within its framework for evaluation.
Dataset Splits	No	The paper specifies the number of negotiations run per pair of agents ("We run 60 negotiations for each ordered pair of agents in each scenario."), but it does not provide specific details on traditional dataset splits (e.g., train/validation/test percentages or sample counts) for a dataset.
Hardware Specification	No	The paper does not provide specific details about the hardware used for running the experiments, such as GPU models, CPU types, or memory specifications.
Software Dependencies	No	The paper mentions that NEGOTIATIONARENA is "implemented in Python," but it does not provide specific version numbers for Python or any other libraries or frameworks used.
Experiment Setup	Yes	We run 60 negotiations for each ordered pair of agents in each scenario. Both GPT and Claude are using a temperature of 0.7 and they can generate a response of a maximum of 400 tokens. We add behavioral prompts to the system prompt of each game.