Multimodal Web Navigation with Instruction-Finetuned Foundation Models
Authors: Hiroki Furuta, Kuang-Huei Lee, Ofir Nachum, Yutaka Matsuo, Aleksandra Faust, Shixiang Shane Gu, Izzeddin Gur
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically demonstrate this recipe improves the agent s ability of grounded multimodal perception, HTML comprehension, and multi-step reasoning, outperforming prior works by a significant margin. On the Mini Wo B, we improve over the previous best offline methods by more than 45.8%, even outperforming online-finetuned So TA, humans, and GPT-4-based agent. On the Web Shop benchmark, our 3-billion-parameter model achieves superior performance to the existing So TA, Pa LM-540B. Furthermore, Web GUM exhibits strong positive transfer to the real-world planning tasks on the Mind2Web. We also collect 347K high-quality demonstrations using our trained models, 38 times larger than prior work, and make them available to promote future research in this direction. |
| Researcher Affiliation | Collaboration | Hiroki Furuta1,2 Kuang-Huei Lee2 Ofir Nachum2 Yutaka Matsuo1 Aleksandra Faust2 Shixiang Shane Gu1,2 Izzeddin Gur2 1The University of Tokyo 2Google Deep Mind |
| Pseudocode | No | No pseudocode or algorithm blocks were found in the paper. |
| Open Source Code | Yes | We also collect 347K high-quality demonstrations using our trained models, 38 times larger than prior work, and make them available to promote future research in this direction. 1https://console.cloud.google.com/storage/browser/gresearch/webllm |
| Open Datasets | Yes | On the Mini Wo B, we improve over the previous best offline methods by more than 45.8%... On the Web Shop benchmark, our 3-billion-parameter model achieves superior performance to the existing So TA, Pa LM-540B. ... We also collect 347K high-quality demonstrations using our trained models, 38 times larger than prior work, and make them available to promote future research in this direction. 1https://console.cloud.google.com/storage/browser/gresearch/webllm |
| Dataset Splits | No | The paper does not explicitly mention a validation dataset split with specific proportions or sample counts. It refers to training data sizes (2.8K, 68K, 347K demonstrations) and uses 100 evaluation episodes per task for MiniWoB++ testing and 500 user instructions for Web Shop testing. |
| Hardware Specification | Yes | We use cloud TPU-v4, which has a 32 Gi B HBM memory space for the experiments. Base-size models require 256 cores and XL-size models do 512 cores, which takes 1-2 days for finetuning. |
| Software Dependencies | No | The paper mentions software components like "Seq IO (Roberts et al., 2022) library" and "Sentence Piece (Kudo & Richardson, 2018) vocabulary," but it does not specify version numbers for these or other ancillary software components necessary for reproduction. |
| Experiment Setup | Yes | The batch size for training is 128, and input sequence length is set to 4096 tokens. |