Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

AgentOccam: A Simple Yet Strong Baseline for LLM-Based Web Agents

Authors: Ke Yang, Yao Liu, Sapana Chaudhary, Rasool Fakoor, Pratik A Chaudhari, George Karypis, Huzefa Rangwala

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Specifically, on Web Arena, a benchmark featuring general-purpose web interaction tasks, our agent AGENTOCCAM surpasses the previous state-of-the-art and concurrent work by 9.8 (+29.4%) and 5.9 (+15.8%) absolute points respectively, and boosts the success rate by 26.6 points (+161%) over similar plain web agents with its observation and action space alignment. Furthermore, on Web Voyager benchmark comprising tasks defined on real-world websites, AGENTOCCAM exceeds the former best agent by 2.4 points (+4.6%) on tasks with deterministic answers.
Researcher Affiliation Collaboration University of Illinois Urbana-Champaign , Amazon EMAIL, EMAIL
Pseudocode No The paper describes methods using natural language and flowcharts/diagrams (Figures 1, 2, 3, 4) to illustrate the architecture and process, but does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes Our code and data are available at https://github.com/amazon-science/Agent Occam.
Open Datasets Yes Specifically, on Web Arena, a benchmark featuring general-purpose web interaction tasks... Furthermore, on Web Voyager benchmark comprising tasks defined on real-world websites... We utilize Web Arena, an interactive web simulator, as our benchmark. ... We extend experiments to the real-world web environments, evluated with the tasks and golden answers proposed in Web Voyager (He et al., 2024).
Dataset Splits Yes The benchmark consists of 812 tasks generated from 241 templates. ... We conduct the full set of ablation studies on a Web Arena development subset with GEMINI-1.5-FLASH (Google, 2024), a model trained with different data from the GPT model family. ... we construct a representative subset from the original 812 tasks in Web Arena. Specifically, we sample one task from each task cluster instantiated with the same intent template ... forming a development set with 190 tasks9.
Hardware Specification No The paper mentions using 'GPT-4-turbo-2024-04-09 (Achiam et al., 2023)' and 'GEMINI-1.5-FLASH (Google, 2024)' as models, but does not provide any specific hardware details like CPU, GPU models, or cloud computing instance types used to run the experiments.
Software Dependencies No The paper mentions using 'GPT-4-turbo-2024-04-09' and 'GEMINI-1.5-FLASH' (which are language models, not general software dependencies for implementation) and 'GPT2 tokenizer from HUGGINGFACE (Radford et al., 2019)' without specifying a version number for the tokenizer itself. No other specific software dependencies with version numbers are provided.
Experiment Setup Yes As shown in Figure 1, our method comprises of three components: i) We reduce non-essential actions to minimize the agent's embodiment and trivial interaction needs; ii) We refine the observation by eliminating redundant and irrelevant web elements, and restructuring web content blocks for more succinct yet as informative representations; iii) We introduce two planning actions (branch and prune), which enables the agent to self-organize navigation workflow with a planning tree, and use the same structure to filter the previous traces for history replay. ... We evaluate the contribution of each component in AGENTOCCAM described in Section 4 to its overall success by incrementally integrating them into the vanilla agent (Web Arena Replication) and assessing the marginal performance gain shown in Figure 5.