Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
AgentOccam: A Simple Yet Strong Baseline for LLM-Based Web Agents
Authors: Ke Yang, Yao Liu, Sapana Chaudhary, Rasool Fakoor, Pratik A Chaudhari, George Karypis, Huzefa Rangwala
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Specifically, on Web Arena, a benchmark featuring general-purpose web interaction tasks, our agent AGENTOCCAM surpasses the previous state-of-the-art and concurrent work by 9.8 (+29.4%) and 5.9 (+15.8%) absolute points respectively, and boosts the success rate by 26.6 points (+161%) over similar plain web agents with its observation and action space alignment. Furthermore, on Web Voyager benchmark comprising tasks defined on real-world websites, AGENTOCCAM exceeds the former best agent by 2.4 points (+4.6%) on tasks with deterministic answers. |
| Researcher Affiliation | Collaboration | University of Illinois Urbana-Champaign , Amazon EMAIL, EMAIL |
| Pseudocode | No | The paper describes methods using natural language and flowcharts/diagrams (Figures 1, 2, 3, 4) to illustrate the architecture and process, but does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code and data are available at https://github.com/amazon-science/Agent Occam. |
| Open Datasets | Yes | Specifically, on Web Arena, a benchmark featuring general-purpose web interaction tasks... Furthermore, on Web Voyager benchmark comprising tasks defined on real-world websites... We utilize Web Arena, an interactive web simulator, as our benchmark. ... We extend experiments to the real-world web environments, evluated with the tasks and golden answers proposed in Web Voyager (He et al., 2024). |
| Dataset Splits | Yes | The benchmark consists of 812 tasks generated from 241 templates. ... We conduct the full set of ablation studies on a Web Arena development subset with GEMINI-1.5-FLASH (Google, 2024), a model trained with different data from the GPT model family. ... we construct a representative subset from the original 812 tasks in Web Arena. Specifically, we sample one task from each task cluster instantiated with the same intent template ... forming a development set with 190 tasks9. |
| Hardware Specification | No | The paper mentions using 'GPT-4-turbo-2024-04-09 (Achiam et al., 2023)' and 'GEMINI-1.5-FLASH (Google, 2024)' as models, but does not provide any specific hardware details like CPU, GPU models, or cloud computing instance types used to run the experiments. |
| Software Dependencies | No | The paper mentions using 'GPT-4-turbo-2024-04-09' and 'GEMINI-1.5-FLASH' (which are language models, not general software dependencies for implementation) and 'GPT2 tokenizer from HUGGINGFACE (Radford et al., 2019)' without specifying a version number for the tokenizer itself. No other specific software dependencies with version numbers are provided. |
| Experiment Setup | Yes | As shown in Figure 1, our method comprises of three components: i) We reduce non-essential actions to minimize the agent's embodiment and trivial interaction needs; ii) We refine the observation by eliminating redundant and irrelevant web elements, and restructuring web content blocks for more succinct yet as informative representations; iii) We introduce two planning actions (branch and prune), which enables the agent to self-organize navigation workflow with a planning tree, and use the same structure to filter the previous traces for history replay. ... We evaluate the contribution of each component in AGENTOCCAM described in Section 4 to its overall success by incrementally integrating them into the vanilla agent (Web Arena Replication) and assessing the marginal performance gain shown in Figure 5. |