reproducibilityindex.ai

AgentBench: Evaluating LLMs as Agents

Authors: Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, Jie Tang

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We present AGENTBENCH, a multi-dimensional benchmark that consists of 8 distinct environments to assess LLM-as-Agent s reasoning and decision-making abilities. Our extensive test over 29 API-based and open-sourced (OSS) LLMs shows that, while top commercial LLMs present a strong ability of acting as agents in complex environments, there is a significant disparity in performance between them and many OSS competitors that are no larger than 70B.
Researcher Affiliation	Academia	1Tsinghua University, 2The Ohio State University, 3UC Berkeley
Pseudocode	No	The paper describes experimental procedures and data processing steps in narrative text and figures but does not include formal pseudocode blocks or algorithms.
Open Source Code	Yes	Datasets, environments, and an integrated evaluation package for AGENTBENCH are released at https://github.com/THUDM/Agent Bench.
Open Datasets	Yes	All datasets, whether newly created or adapted from existing ones, are meticulously designed and reformulated to simulate interactive environments where text-only LLMs can operate as autonomous agents. All datasets are publicly available.
Dataset Splits	Yes	We provide two splits for each dataset: Dev and Test. All datasets are publicly available.
Hardware Specification	No	We would like to thank... Zhipu AI for covering all GPU and API cost consumed in this study. (This mentions "GPU" but lacks specific model numbers or detailed specifications.)
Software Dependencies	No	The paper mentions software like "pybind" and "Virtuoso" but does not specify their version numbers, which are required for a reproducible description of software dependencies.
Experiment Setup	Yes	To ensure reproducible results, we set temperature=0 (i.e., greedy decoding) in the inference on all tasks following (Wei et al., 2022b).