How Well Can LLMs Negotiate? NegotiationArena Platform and Analysis

Authors: Federico Bianchi, Patrick John Chia, Mert Yuksekgonul, Jacopo Tagliabue, Dan Jurafsky, James Zou

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we study how well LLMs can negotiate with each other. We develop NEGOTIATIONARENA: a flexible framework for evaluating and probing the negotiation abilities of LLM agents. We implemented three types of scenarios in NEGOTIATIONARENA to assess LLM s behaviors in allocating shared resources (ultimatum games), aggregate resources (trading games) and buy/sell goods (price negotiations).
Researcher Affiliation Collaboration 1Stanford University, Stanford, California 2Independent 3Bauplan, New York, New York.
Pseudocode No The paper describes the system implementation and communication format using XML-like tags, but it does not include any formal pseudocode or algorithm blocks.
Open Source Code Yes Our contributions: We propose NEGOTIATIONARENA: an open-source framework to evaluate and probe the negotiation abilities of LLM agents. 1NEGOTIATIONARENA is available at https://github. com/vinid/Negotiation Arena.
Open Datasets No The paper defines and implements custom negotiation scenarios rather than using a pre-existing publicly available dataset. It describes how interactions are generated within its framework for evaluation.
Dataset Splits No The paper specifies the number of negotiations run per pair of agents ("We run 60 negotiations for each ordered pair of agents in each scenario."), but it does not provide specific details on traditional dataset splits (e.g., train/validation/test percentages or sample counts) for a dataset.
Hardware Specification No The paper does not provide specific details about the hardware used for running the experiments, such as GPU models, CPU types, or memory specifications.
Software Dependencies No The paper mentions that NEGOTIATIONARENA is "implemented in Python," but it does not provide specific version numbers for Python or any other libraries or frameworks used.
Experiment Setup Yes We run 60 negotiations for each ordered pair of agents in each scenario. Both GPT and Claude are using a temperature of 0.7 and they can generate a response of a maximum of 400 tokens. We add behavioral prompts to the system prompt of each game.