reproducibilityindex.ai

Can large language models explore in-context?

Authors: Akshay Krishnamurthy, Keegan Harris, Dylan J Foster, Cyril Zhang, Aleksandrs Slivkins

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We investigate the extent to which contemporary Large Language Models (LLMs) can engage in exploration, a core capability in reinforcement learning and decision making. We deploy LLMs as agents in simple multi-armed bandit environments, specifying the environment description and interaction history entirely in-context, i.e., within the LLM prompt. We experiment with GPT-3.5, GPT-4, and LLAMA2, using a variety of prompt designs, and find that the models do not robustly engage in exploration without substantial interventions
Researcher Affiliation	Collaboration	Akshay Krishnamurthy1 Keegan Harris2 Dylan J. Foster1 Cyril Zhang1 Aleksandrs Slivkins1 1Microsoft Research 2Carnegie Mellon University
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks. Algorithms are described in natural language.
Open Source Code	No	The paper does not include an explicit statement about releasing its source code or a link to a code repository for the methodology described.
Open Datasets	No	The paper defines and generates multi-armed bandit instances with specific parameters (e.g., 'K = 5 arms and gap = 0.2', 'K = 4 and = 0.5') rather than using a pre-existing publicly available dataset. Therefore, no concrete access information for a public dataset is provided.
Dataset Splits	No	The paper describes experimental parameters such as time horizon (T = 100, T = 200, T = 500) and number of replicates (N={10, 20, 40}) to account for randomness. However, it does not specify traditional train/validation/test dataset splits, as the data is generated through interaction with the multi-armed bandit environment.
Hardware Specification	No	The paper mentions that 'LLAMA2 was essentially free from our perspective (since it was locally hosted)' and refers to GPT-3.5 and GPT-4 via APIs. However, it does not provide specific details about the hardware used for local hosting (e.g., GPU/CPU models, memory specifications).
Software Dependencies	Yes	Specifically: GPT-3.5-TURBO-0613 (released 06/13/2023), GPT-4-0613 (released 06/13/2023), and LLAMA2-13B-CHAT quantized to 4-bits [24].
Experiment Setup	Yes	Our prompt design allows several independent choices. First is a scenario ..., Second, we specify a framing ..., Third, the history can be presented as ..., Fourth, the requested final answer can be ..., Finally, we either ... request the answer only, or ... also allow the LLM to provide a chain-of-thought" (Co T) explanation. Altogether, these choices lead to 25 = 32 prompt designs, illustrated in Figure 2. We also consider two choices for the temperature parameter, 0 and 1. The main instance we consider has K = 5 arms and gap = 0.2.