reproducibilityindex.ai

Multimodal Web Navigation with Instruction-Finetuned Foundation Models

Authors: Hiroki Furuta, Kuang-Huei Lee, Ofir Nachum, Yutaka Matsuo, Aleksandra Faust, Shixiang Shane Gu, Izzeddin Gur

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically demonstrate this recipe improves the agent s ability of grounded multimodal perception, HTML comprehension, and multi-step reasoning, outperforming prior works by a signiﬁcant margin. On the Mini Wo B, we improve over the previous best ofﬂine methods by more than 45.8%, even outperforming online-ﬁnetuned So TA, humans, and GPT-4-based agent. On the Web Shop benchmark, our 3-billion-parameter model achieves superior performance to the existing So TA, Pa LM-540B. Furthermore, Web GUM exhibits strong positive transfer to the real-world planning tasks on the Mind2Web. We also collect 347K high-quality demonstrations using our trained models, 38 times larger than prior work, and make them available to promote future research in this direction.
Researcher Affiliation	Collaboration	Hiroki Furuta1,2 Kuang-Huei Lee2 Oﬁr Nachum2 Yutaka Matsuo1 Aleksandra Faust2 Shixiang Shane Gu1,2 Izzeddin Gur2 1The University of Tokyo 2Google Deep Mind
Pseudocode	No	No pseudocode or algorithm blocks were found in the paper.
Open Source Code	Yes	We also collect 347K high-quality demonstrations using our trained models, 38 times larger than prior work, and make them available to promote future research in this direction. 1https://console.cloud.google.com/storage/browser/gresearch/webllm
Open Datasets	Yes	On the Mini Wo B, we improve over the previous best ofﬂine methods by more than 45.8%... On the Web Shop benchmark, our 3-billion-parameter model achieves superior performance to the existing So TA, Pa LM-540B. ... We also collect 347K high-quality demonstrations using our trained models, 38 times larger than prior work, and make them available to promote future research in this direction. 1https://console.cloud.google.com/storage/browser/gresearch/webllm
Dataset Splits	No	The paper does not explicitly mention a validation dataset split with specific proportions or sample counts. It refers to training data sizes (2.8K, 68K, 347K demonstrations) and uses 100 evaluation episodes per task for MiniWoB++ testing and 500 user instructions for Web Shop testing.
Hardware Specification	Yes	We use cloud TPU-v4, which has a 32 Gi B HBM memory space for the experiments. Base-size models require 256 cores and XL-size models do 512 cores, which takes 1-2 days for ﬁnetuning.
Software Dependencies	No	The paper mentions software components like "Seq IO (Roberts et al., 2022) library" and "Sentence Piece (Kudo & Richardson, 2018) vocabulary," but it does not specify version numbers for these or other ancillary software components necessary for reproduction.
Experiment Setup	Yes	The batch size for training is 128, and input sequence length is set to 4096 tokens.