reproducibilityindex.ai

WebArena: A Realistic Web Environment for Building Autonomous Agents

Authors: Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, Graham Neubig

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We experiment with several baseline agents, integrating recent techniques such as reasoning before acting. The results demonstrate that solving complex tasks is challenging: our best GPT-4-based agent only achieves an end-to-end task success rate of 14.41%, significantly lower than the human performance of 78.24%.
Researcher Affiliation	Academia	Shuyan Zhou Frank F. Xu Hao Zhu Xuhui Zhou Robert Lo Abishek Sridhar Xianyi Cheng Tianyue Ou Yonatan Bisk Daniel Fried Uri Alon Graham Neubig Carnegie Mellon University {shuyanzh, fangzhex, gneubig}@cs.cmu.edu
Pseudocode	No	The paper does not contain any pseudocode or algorithm blocks.
Open Source Code	Yes	Our code, data, environment reproduction resources, and video demonstrations are publicly available at https://webarena.dev/.
Open Datasets	Yes	Along with Web Arena, we release a ready-to-use benchmark with 812 long-horizon web-based tasks ( 3).
Dataset Splits	No	The paper provides a benchmark of "812 test examples" but does not specify how these tasks are split into training, validation, and testing sets within their own benchmark. It only refers to a test set.
Hardware Specification	No	The paper mentions support from "Center for AI Safety for providing computational resources" and "AWS AI" but does not specify any particular hardware (e.g., GPU models, CPU types, memory) used for running experiments.
Software Dependencies	No	The paper mentions using specific models like GPT-3.5-TURBO-16K-0613, GPT-4-0613, and TEXT-BISON-001, and open-source platforms like Adobe Magento, Postmill, Git Lab, Open Street Map, and Docker containers. However, it does not explicitly list specific software libraries or tools with their version numbers directly used for running the experiments beyond the model names themselves.
Experiment Setup	Yes	We experiment with GPT-3.5-TURBO-16K-0613, GPT-4-0613, and TEXT-BISON-001 with a temperature of 1.0 and a top-p parameter of 0.9. The maximum number of state transitions is set to 30. We halt execution if the same action is repeated more than three times on the same observation or if the agent generates three consecutive invalid actions. These situations typically indicate a high likelihood of execution failure and hence warrant early termination. For TEXT-BISON-001, we additionally allow ten retries until it generates a valid action.