reproducibilityindex.ai

A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis

Authors: Izzeddin Gur, Hiroki Furuta, Austin V Huang, Mustafa Safdari, Yutaka Matsuo, Douglas Eck, Aleksandra Faust

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically demonstrate that our modular recipe improves the success on real websites by over 50%, and that HTMLT5 is the best model to solve various HTML understanding tasks; achieving 18.7% higher success rate than the prior method on Mini Wo B web automation benchmark, and So TA performance on Mind2Web, an ofﬂine task planning evaluation. 4 EXPERIMENTAL RESULTS
Researcher Affiliation	Collaboration	1Google Deep Mind, 2The University of Tokyo
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code	No	The paper does not explicitly state that the code for the described methodology is open-source or provide a link to a code repository.
Open Datasets	Yes	On Mini Wo B++ (Liu et al., 2018; Shi et al., 2017), HTML-T5 achieves 18.7% higher success than previous language model agent (Gur et al., 2022) ... On the Mind2Web (Deng et al., 2023), an ofﬂine task planning dataset, HTML-T5 achieves So TA performance... For the pre-training dataset, we collect 100 WARC ﬁles (April 2019) from the Common Crawl corpus...
Dataset Splits	No	The paper mentions training, but does not provide specific percentages or counts for training/validation/test splits, nor does it cite a predefined split in a way that allows reproduction of the split.
Hardware Specification	Yes	We have used cloud TPU-v3, which has a 32 Gi B HBM memory space, with 128 cores for the experiments.
Software Dependencies	Yes	We use the implementation of local and global attentions released by Guo et al. (2022) ... We leverage Seq IO (Roberts et al., 2022) and T5X (Roberts et al., 2022) library to manage the training pipeline. We also use Sentence Piece (Kudo & Richardson, 2018) with 32K tokens from C4 dataset (Raffel et al., 2020) as a tokenizer. ... We use Selenium Web Driver 2, a popular library for browser automation...
Experiment Setup	Yes	We adopt 4096 input sequence length and 910 output sequence length during pre-training. In total, 15% of input tokens are randomly masked in the denoising objective. ... We pre-train HTML-T5 for 100K iterations following the practice in other T5 models (Chung et al., 2022; Lester et al., 2021). ... We set the local radius to r = 127, and block size for transient global attention to k = 16. ... We adopt 16K tokens for the context window unless otherwise mentioned.