Decision-Making Behavior Evaluation Framework for LLMs under Uncertain Context
Authors: Jingru (Jessica) Jia, Zehua Yuan, Junhao Pan, Paul McNamara, Deming Chen
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This paper proposes a framework, grounded in behavioral economics theories, to evaluate the decision-making behaviors of LLMs. With a multiple-choice-list experiment, we initially estimate the degree of risk preference, probability weighting, and loss aversion in a context-free setting for three commercial LLMs: Chat GPT-4.0-Turbo, Claude-3-Opus, and Gemini-1.0-pro. Our results reveal that LLMs generally exhibit patterns similar to humans, such as risk aversion and loss aversion, with a tendency to overweight small probabilities, but there are significant variations in the degree to which these behaviors are expressed across different LLMs. |
| Researcher Affiliation | Academia | Jingru Jia*, Zehua Yuan*, Junhao Pan, Paul E. Mc Namara, and Deming Chen University of Illinois at Urbana-Champaign {jingruj3, zehuay2, jpan22, mcnamar1, dchen}@illinois.edu |
| Pseudocode | No | The paper describes the steps of its framework (Experimentation Design, Recording Switching Points, Setting Up Inequalities, Estimating Parameters, Behavior Evaluation) but does not include structured pseudocode or an algorithm block. |
| Open Source Code | Yes | 5. Open access to data and code Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] . Justification: We release our code and data, with the instructions on exact command and enviroment. |
| Open Datasets | Yes | 5. Open access to data and code Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] . Justification: We release our code and data, with the instructions on exact command and enviroment. |
| Dataset Splits | No | The paper conducts behavioral experiments on LLMs, which act as the subjects of the study, similar to human participants. It does not involve training a machine learning model on a dataset with traditional train/validation/test splits. Therefore, there are no validation dataset splits in the context of model training. |
| Hardware Specification | No | The paper does not specify the hardware (e.g., CPU, GPU models, memory) used to run the experiments or to interact with the LLMs. |
| Software Dependencies | No | The paper does not provide specific software dependency names with version numbers required for replication (e.g., Python 3.x, PyTorch 1.x). |
| Experiment Setup | Yes | We implemented a data collection pipeline to conduct each experiment through API calls to ensure consistency. All three models were tested across two context settings, including context-free and embedded demographic features. Prompt templates were specifically designed to optimize for responsiveness and answer validity, with an example prompt for Chat GPT provided in Appendix C. We chose a sample size of 300 data pieces... the same session was maintained for each trial during data collection. History from the previous game sets was cleared to prevent LLMs from recollecting previous games. |