WebArena: A Realistic Web Environment for Building Autonomous Agents
Authors: Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, Graham Neubig
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We experiment with several baseline agents, integrating recent techniques such as reasoning before acting. The results demonstrate that solving complex tasks is challenging: our best GPT-4-based agent only achieves an end-to-end task success rate of 14.41%, significantly lower than the human performance of 78.24%. |
| Researcher Affiliation | Academia | Shuyan Zhou Frank F. Xu Hao Zhu Xuhui Zhou Robert Lo Abishek Sridhar Xianyi Cheng Tianyue Ou Yonatan Bisk Daniel Fried Uri Alon Graham Neubig Carnegie Mellon University {shuyanzh, fangzhex, gneubig}@cs.cmu.edu |
| Pseudocode | No | The paper does not contain any pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code, data, environment reproduction resources, and video demonstrations are publicly available at https://webarena.dev/. |
| Open Datasets | Yes | Along with Web Arena, we release a ready-to-use benchmark with 812 long-horizon web-based tasks ( 3). |
| Dataset Splits | No | The paper provides a benchmark of "812 test examples" but does not specify how these tasks are split into training, validation, and testing sets within their own benchmark. It only refers to a test set. |
| Hardware Specification | No | The paper mentions support from "Center for AI Safety for providing computational resources" and "AWS AI" but does not specify any particular hardware (e.g., GPU models, CPU types, memory) used for running experiments. |
| Software Dependencies | No | The paper mentions using specific models like GPT-3.5-TURBO-16K-0613, GPT-4-0613, and TEXT-BISON-001, and open-source platforms like Adobe Magento, Postmill, Git Lab, Open Street Map, and Docker containers. However, it does not explicitly list specific software libraries or tools with their version numbers directly used for running the experiments beyond the model names themselves. |
| Experiment Setup | Yes | We experiment with GPT-3.5-TURBO-16K-0613, GPT-4-0613, and TEXT-BISON-001 with a temperature of 1.0 and a top-p parameter of 0.9. The maximum number of state transitions is set to 30. We halt execution if the same action is repeated more than three times on the same observation or if the agent generates three consecutive invalid actions. These situations typically indicate a high likelihood of execution failure and hence warrant early termination. For TEXT-BISON-001, we additionally allow ten retries until it generates a valid action. |