reproducibilityindex.ai

SmartPlay : A Benchmark for LLMs as Intelligent Agents

Authors: Yue Wu, Xuan Tang, Tom Mitchell, Yuanzhi Li

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We introduce Smart Play: both a challenging benchmark and a methodology for evaluating LLMs as agents. Smart Play consists of 6 different games... We use Smart Play to compare the agent performance of recent LLMs... 5 EXPERIMENTAL RESULTS
Researcher Affiliation	Collaboration	Yue Wu12 , Xuan Tang1, Tom Mitchell1, Yuanzhi Li12 1Carnegie Mellon University, 2Microsoft Research
Pseudocode	No	No pseudocode or algorithm blocks are provided in the paper.
Open Source Code	Yes	We release our benchmark at github.com/microsoft/Smart Play.
Open Datasets	Yes	The Smart Play benchmark uses various game environments, citing their original sources for public access: “Crafter (Hafner, 2021)”, “MESSENGER (Hanjie et al., 2021)”, and “Minecraft (Fan et al., 2022)”.
Dataset Splits	No	No explicit training/validation/test dataset splits are provided as the paper evaluates pre-trained LLMs as agents in game environments over multiple trials, rather than training models on specific data splits.
Hardware Specification	No	The paper evaluates the performance of various large language models (LLMs) on the Smart Play benchmark but does not specify the hardware used to run these evaluations or the hardware used to train the LLMs themselves.
Software Dependencies	No	The paper mentions using a “unified Open AI Gym interface” and refers to specific game implementations from GitHub, but no specific software versions (e.g., Python, PyTorch, Gym version) are provided.
Experiment Setup	Yes	For ease of use and wide compatibility, Smart Play follows a unified Open AI Gym interface (Brockman et al., 2016) for all games, with text-based observations, text-based manuals with content as described in Table 1, text describing historical actions and observations covering past steps of length history length , and flat categorical actions. We follow Wu et al. (2023c) and directly prompt an LLM: What is the next action to take, let s think step by step. , with manual, history, and current observation as context. We then query the LLM: Choose the best executable action from the list of all actions. Write the exact chosen action.