A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis
Authors: Izzeddin Gur, Hiroki Furuta, Austin V Huang, Mustafa Safdari, Yutaka Matsuo, Douglas Eck, Aleksandra Faust
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically demonstrate that our modular recipe improves the success on real websites by over 50%, and that HTMLT5 is the best model to solve various HTML understanding tasks; achieving 18.7% higher success rate than the prior method on Mini Wo B web automation benchmark, and So TA performance on Mind2Web, an offline task planning evaluation. 4 EXPERIMENTAL RESULTS |
| Researcher Affiliation | Collaboration | 1Google Deep Mind, 2The University of Tokyo |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not explicitly state that the code for the described methodology is open-source or provide a link to a code repository. |
| Open Datasets | Yes | On Mini Wo B++ (Liu et al., 2018; Shi et al., 2017), HTML-T5 achieves 18.7% higher success than previous language model agent (Gur et al., 2022) ... On the Mind2Web (Deng et al., 2023), an offline task planning dataset, HTML-T5 achieves So TA performance... For the pre-training dataset, we collect 100 WARC files (April 2019) from the Common Crawl corpus... |
| Dataset Splits | No | The paper mentions training, but does not provide specific percentages or counts for training/validation/test splits, nor does it cite a predefined split in a way that allows reproduction of the split. |
| Hardware Specification | Yes | We have used cloud TPU-v3, which has a 32 Gi B HBM memory space, with 128 cores for the experiments. |
| Software Dependencies | Yes | We use the implementation of local and global attentions released by Guo et al. (2022) ... We leverage Seq IO (Roberts et al., 2022) and T5X (Roberts et al., 2022) library to manage the training pipeline. We also use Sentence Piece (Kudo & Richardson, 2018) with 32K tokens from C4 dataset (Raffel et al., 2020) as a tokenizer. ... We use Selenium Web Driver 2, a popular library for browser automation... |
| Experiment Setup | Yes | We adopt 4096 input sequence length and 910 output sequence length during pre-training. In total, 15% of input tokens are randomly masked in the denoising objective. ... We pre-train HTML-T5 for 100K iterations following the practice in other T5 models (Chung et al., 2022; Lester et al., 2021). ... We set the local radius to r = 127, and block size for transient global attention to k = 16. ... We adopt 16K tokens for the context window unless otherwise mentioned. |