SmartPlay : A Benchmark for LLMs as Intelligent Agents

Authors: Yue Wu, Xuan Tang, Tom Mitchell, Yuanzhi Li

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We introduce Smart Play: both a challenging benchmark and a methodology for evaluating LLMs as agents. Smart Play consists of 6 different games... We use Smart Play to compare the agent performance of recent LLMs... 5 EXPERIMENTAL RESULTS
Researcher Affiliation Collaboration Yue Wu12 , Xuan Tang1, Tom Mitchell1, Yuanzhi Li12 1Carnegie Mellon University, 2Microsoft Research
Pseudocode No No pseudocode or algorithm blocks are provided in the paper.
Open Source Code Yes We release our benchmark at github.com/microsoft/Smart Play.
Open Datasets Yes The Smart Play benchmark uses various game environments, citing their original sources for public access: “Crafter (Hafner, 2021)”, “MESSENGER (Hanjie et al., 2021)”, and “Minecraft (Fan et al., 2022)”.
Dataset Splits No No explicit training/validation/test dataset splits are provided as the paper evaluates pre-trained LLMs as agents in game environments over multiple trials, rather than training models on specific data splits.
Hardware Specification No The paper evaluates the performance of various large language models (LLMs) on the Smart Play benchmark but does not specify the hardware used to run these evaluations or the hardware used to train the LLMs themselves.
Software Dependencies No The paper mentions using a “unified Open AI Gym interface” and refers to specific game implementations from GitHub, but no specific software versions (e.g., Python, PyTorch, Gym version) are provided.
Experiment Setup Yes For ease of use and wide compatibility, Smart Play follows a unified Open AI Gym interface (Brockman et al., 2016) for all games, with text-based observations, text-based manuals with content as described in Table 1, text describing historical actions and observations covering past steps of length history length , and flat categorical actions. We follow Wu et al. (2023c) and directly prompt an LLM: What is the next action to take, let s think step by step. , with manual, history, and current observation as context. We then query the LLM: Choose the best executable action from the list of all actions. Write the exact chosen action.