reproducibilityindex.ai

Decision-Making Behavior Evaluation Framework for LLMs under Uncertain Context

Authors: Jingru (Jessica) Jia, Zehua Yuan, Junhao Pan, Paul McNamara, Deming Chen

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This paper proposes a framework, grounded in behavioral economics theories, to evaluate the decision-making behaviors of LLMs. With a multiple-choice-list experiment, we initially estimate the degree of risk preference, probability weighting, and loss aversion in a context-free setting for three commercial LLMs: Chat GPT-4.0-Turbo, Claude-3-Opus, and Gemini-1.0-pro. Our results reveal that LLMs generally exhibit patterns similar to humans, such as risk aversion and loss aversion, with a tendency to overweight small probabilities, but there are significant variations in the degree to which these behaviors are expressed across different LLMs.
Researcher Affiliation	Academia	Jingru Jia, Zehua Yuan, Junhao Pan, Paul E. Mc Namara, and Deming Chen University of Illinois at Urbana-Champaign {jingruj3, zehuay2, jpan22, mcnamar1, dchen}@illinois.edu
Pseudocode	No	The paper describes the steps of its framework (Experimentation Design, Recording Switching Points, Setting Up Inequalities, Estimating Parameters, Behavior Evaluation) but does not include structured pseudocode or an algorithm block.
Open Source Code	Yes	5. Open access to data and code Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] . Justification: We release our code and data, with the instructions on exact command and enviroment.
Open Datasets	Yes	5. Open access to data and code Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] . Justification: We release our code and data, with the instructions on exact command and enviroment.
Dataset Splits	No	The paper conducts behavioral experiments on LLMs, which act as the subjects of the study, similar to human participants. It does not involve training a machine learning model on a dataset with traditional train/validation/test splits. Therefore, there are no validation dataset splits in the context of model training.
Hardware Specification	No	The paper does not specify the hardware (e.g., CPU, GPU models, memory) used to run the experiments or to interact with the LLMs.
Software Dependencies	No	The paper does not provide specific software dependency names with version numbers required for replication (e.g., Python 3.x, PyTorch 1.x).
Experiment Setup	Yes	We implemented a data collection pipeline to conduct each experiment through API calls to ensure consistency. All three models were tested across two context settings, including context-free and embedded demographic features. Prompt templates were specifically designed to optimize for responsiveness and answer validity, with an example prompt for Chat GPT provided in Appendix C. We chose a sample size of 300 data pieces... the same session was maintained for each trial during data collection. History from the previous game sets was cleared to prevent LLMs from recollecting previous games.