Multimodal Web Navigation with Instruction-Finetuned Foundation Models

Authors: Hiroki Furuta, Kuang-Huei Lee, Ofir Nachum, Yutaka Matsuo, Aleksandra Faust, Shixiang Shane Gu, Izzeddin Gur

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically demonstrate this recipe improves the agent s ability of grounded multimodal perception, HTML comprehension, and multi-step reasoning, outperforming prior works by a significant margin. On the Mini Wo B, we improve over the previous best offline methods by more than 45.8%, even outperforming online-finetuned So TA, humans, and GPT-4-based agent. On the Web Shop benchmark, our 3-billion-parameter model achieves superior performance to the existing So TA, Pa LM-540B. Furthermore, Web GUM exhibits strong positive transfer to the real-world planning tasks on the Mind2Web. We also collect 347K high-quality demonstrations using our trained models, 38 times larger than prior work, and make them available to promote future research in this direction.
Researcher Affiliation Collaboration Hiroki Furuta1,2 Kuang-Huei Lee2 Ofir Nachum2 Yutaka Matsuo1 Aleksandra Faust2 Shixiang Shane Gu1,2 Izzeddin Gur2 1The University of Tokyo 2Google Deep Mind
Pseudocode No No pseudocode or algorithm blocks were found in the paper.
Open Source Code Yes We also collect 347K high-quality demonstrations using our trained models, 38 times larger than prior work, and make them available to promote future research in this direction. 1https://console.cloud.google.com/storage/browser/gresearch/webllm
Open Datasets Yes On the Mini Wo B, we improve over the previous best offline methods by more than 45.8%... On the Web Shop benchmark, our 3-billion-parameter model achieves superior performance to the existing So TA, Pa LM-540B. ... We also collect 347K high-quality demonstrations using our trained models, 38 times larger than prior work, and make them available to promote future research in this direction. 1https://console.cloud.google.com/storage/browser/gresearch/webllm
Dataset Splits No The paper does not explicitly mention a validation dataset split with specific proportions or sample counts. It refers to training data sizes (2.8K, 68K, 347K demonstrations) and uses 100 evaluation episodes per task for MiniWoB++ testing and 500 user instructions for Web Shop testing.
Hardware Specification Yes We use cloud TPU-v4, which has a 32 Gi B HBM memory space for the experiments. Base-size models require 256 cores and XL-size models do 512 cores, which takes 1-2 days for finetuning.
Software Dependencies No The paper mentions software components like "Seq IO (Roberts et al., 2022) library" and "Sentence Piece (Kudo & Richardson, 2018) vocabulary," but it does not specify version numbers for these or other ancillary software components necessary for reproduction.
Experiment Setup Yes The batch size for training is 128, and input sequence length is set to 4096 tokens.