STEER: Assessing the Economic Rationality of Large Language Models
Authors: Narun Krishnamurthi Raman, Taylor Lundy, Samuel Joseph Amouyal, Yoav Levine, Kevin Leyton-Brown, Moshe Tennenholtz
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, we describe the results of a large-scale empirical experiment with 14 different LLMs, characterizing the both current state of the art and the impact of different model sizes on models ability to exhibit rational behavior. |
| Researcher Affiliation | Collaboration | 1Department of Computer Science, University of British Columbia, Vancouver, Canada 2Tel Aviv University, Aviv, Israel 3Stanford & AI21 Labs, Palo Alto, California, United States 4Technion & AI21 Labs, Tel Aviv, Israel. |
| Pseudocode | No | The paper does not contain any sections or figures explicitly labeled as 'Pseudocode' or 'Algorithm', nor does it present structured code-like steps for a procedure. |
| Open Source Code | Yes | We release all model outputs to support evaluation research and contributions, and provide a public website with all results (https://steer-benchmark.cs.ubc.ca), underlying model predictions details, alongside an extensible codebase to support the community in taking SRCs further. |
| Open Datasets | Yes | We then propose a benchmark distribution called STEER (Systematic and Tuneable Evaluation of Economic Rationality) that quantitatively scores an LLMs performance on these elements... For 49 elements, we have written LLM prompts to synthetically generate 24,500 multiple-choice questions... We release all model outputs to support evaluation research and contributions, and provide a public website with all results (https://steer-benchmark.cs.ubc.ca), underlying model predictions details, alongside an extensible codebase to support the community in taking SRCs further. |
| Dataset Splits | No | The paper describes a 'validation step' for the quality of generated questions, where 'randomly spot-checked 100 samples (i.e., 10% of all generated questions)'. This is not a train/validation/test split for training a model, as the LLMs are pre-trained and being evaluated. |
| Hardware Specification | Yes | We ran GPT 3.5 Turbo and 4 Turbo using Open AI s API (Open AI, 2020) and Azure Open AI. We obtained 12 open-source models from the Hugging Face Hub (Wolf et al., 2019) and ran them on an A100 GPU on Compute Canada. |
| Software Dependencies | No | The paper mentions using 'Open AI s API' and obtaining models from the 'Hugging Face Hub' but does not specify version numbers for any software dependencies like Python, specific libraries, or frameworks. |
| Experiment Setup | Yes | We decoded from all LLMs by sampling with temperature 0. We take two approaches to implement this idea, which we dub separate and together... For each question, we select n {0, 1, 2, 4, 5} examples (within the corresponding domain and grade level) to test the effect of prompting on a model s performance. |