GPT-4V(ision) is a Generalist Web Agent, if Grounded

Authors: Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, Yu Su

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate on the recent MIND2WEB benchmark. In addition to standard offline evaluation on cached websites, we enable a new online evaluation setting by developing a tool that allows running web agents on live websites.
Researcher Affiliation Academia Boyuan Zheng 1 Boyu Gou 1 Jihyung Kil 1 Huan Sun 1 Yu Su 1 1The Ohio State University, Columbus, OH.
Pseudocode No Not found. The paper includes structured prompts in Appendix D (e.g., Table 6), but these are not labeled as 'Pseudocode' or 'Algorithm' blocks.
Open Source Code Yes All code, data, and evaluation tools are available at https://github.com/ OSU-NLP-Group/SeeAct.
Open Datasets Yes We evaluate our methods on MIND2WEB (Deng et al., 2023)... This cleaned version of the dataset is called Multimodal Mind2Web, with the statistics in Table 1.2 The dataset is released at https://huggingface.co/ datasets/osunlp/Multimodal-Mind2Web.
Dataset Splits No Not found. Table 1 provides 'Train', 'Cross-Domain', 'Cross-Task', and 'Cross-Website' splits, but no explicit 'validation' split is mentioned.
Hardware Specification No Not found. The paper discusses various models and software tools used, but does not provide specific hardware details (e.g., GPU/CPU models, memory) for running the experiments.
Software Dependencies No Not found. The paper mentions various software components like Playwright, DeBERTa-base, BLIP-2, FLAN-T5, and the Supervision library, but generally lacks specific version numbers for these components, except for GPT model names (e.g., 'GPT-3.5-turbo-0613').
Experiment Setup Yes We adopt the evaluation metrics utilized in MIND2WEB. Element Accuracy (Ele. Acc) compares the predicted element with the ground-truth elements. Operation F1 (Op. F1) calculates the token-level F1 score for the predicted operation comprised of action and input value. Step Success Rate (Step SR) measures the success of each action step. ... In grounding via image annotation and textual choices, we first employ the De BERTa-base cross-encoder from Mind Act (Deng et al., 2023) to rank the top 50 elements for better comparison with its text-only counterparts. Then, we cluster elements into groups of 17 options for inference.