reproducibilityindex.ai

STEER: Assessing the Economic Rationality of Large Language Models

Authors: Narun Krishnamurthi Raman, Taylor Lundy, Samuel Joseph Amouyal, Yoav Levine, Kevin Leyton-Brown, Moshe Tennenholtz

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Finally, we describe the results of a large-scale empirical experiment with 14 different LLMs, characterizing the both current state of the art and the impact of different model sizes on models ability to exhibit rational behavior.
Researcher Affiliation	Collaboration	1Department of Computer Science, University of British Columbia, Vancouver, Canada 2Tel Aviv University, Aviv, Israel 3Stanford & AI21 Labs, Palo Alto, California, United States 4Technion & AI21 Labs, Tel Aviv, Israel.
Pseudocode	No	The paper does not contain any sections or figures explicitly labeled as 'Pseudocode' or 'Algorithm', nor does it present structured code-like steps for a procedure.
Open Source Code	Yes	We release all model outputs to support evaluation research and contributions, and provide a public website with all results (https://steer-benchmark.cs.ubc.ca), underlying model predictions details, alongside an extensible codebase to support the community in taking SRCs further.
Open Datasets	Yes	We then propose a benchmark distribution called STEER (Systematic and Tuneable Evaluation of Economic Rationality) that quantitatively scores an LLMs performance on these elements... For 49 elements, we have written LLM prompts to synthetically generate 24,500 multiple-choice questions... We release all model outputs to support evaluation research and contributions, and provide a public website with all results (https://steer-benchmark.cs.ubc.ca), underlying model predictions details, alongside an extensible codebase to support the community in taking SRCs further.
Dataset Splits	No	The paper describes a 'validation step' for the quality of generated questions, where 'randomly spot-checked 100 samples (i.e., 10% of all generated questions)'. This is not a train/validation/test split for training a model, as the LLMs are pre-trained and being evaluated.
Hardware Specification	Yes	We ran GPT 3.5 Turbo and 4 Turbo using Open AI s API (Open AI, 2020) and Azure Open AI. We obtained 12 open-source models from the Hugging Face Hub (Wolf et al., 2019) and ran them on an A100 GPU on Compute Canada.
Software Dependencies	No	The paper mentions using 'Open AI s API' and obtaining models from the 'Hugging Face Hub' but does not specify version numbers for any software dependencies like Python, specific libraries, or frameworks.
Experiment Setup	Yes	We decoded from all LLMs by sampling with temperature 0. We take two approaches to implement this idea, which we dub separate and together... For each question, we select n {0, 1, 2, 4, 5} examples (within the corresponding domain and grade level) to test the effect of prompting on a model s performance.