GPT-4V(ision) is a Generalist Web Agent, if Grounded
Authors: Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, Yu Su
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate on the recent MIND2WEB benchmark. In addition to standard offline evaluation on cached websites, we enable a new online evaluation setting by developing a tool that allows running web agents on live websites. |
| Researcher Affiliation | Academia | Boyuan Zheng 1 Boyu Gou 1 Jihyung Kil 1 Huan Sun 1 Yu Su 1 1The Ohio State University, Columbus, OH. |
| Pseudocode | No | Not found. The paper includes structured prompts in Appendix D (e.g., Table 6), but these are not labeled as 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | Yes | All code, data, and evaluation tools are available at https://github.com/ OSU-NLP-Group/SeeAct. |
| Open Datasets | Yes | We evaluate our methods on MIND2WEB (Deng et al., 2023)... This cleaned version of the dataset is called Multimodal Mind2Web, with the statistics in Table 1.2 The dataset is released at https://huggingface.co/ datasets/osunlp/Multimodal-Mind2Web. |
| Dataset Splits | No | Not found. Table 1 provides 'Train', 'Cross-Domain', 'Cross-Task', and 'Cross-Website' splits, but no explicit 'validation' split is mentioned. |
| Hardware Specification | No | Not found. The paper discusses various models and software tools used, but does not provide specific hardware details (e.g., GPU/CPU models, memory) for running the experiments. |
| Software Dependencies | No | Not found. The paper mentions various software components like Playwright, DeBERTa-base, BLIP-2, FLAN-T5, and the Supervision library, but generally lacks specific version numbers for these components, except for GPT model names (e.g., 'GPT-3.5-turbo-0613'). |
| Experiment Setup | Yes | We adopt the evaluation metrics utilized in MIND2WEB. Element Accuracy (Ele. Acc) compares the predicted element with the ground-truth elements. Operation F1 (Op. F1) calculates the token-level F1 score for the predicted operation comprised of action and input value. Step Success Rate (Step SR) measures the success of each action step. ... In grounding via image annotation and textual choices, we first employ the De BERTa-base cross-encoder from Mind Act (Deng et al., 2023) to rank the top 50 elements for better comparison with its text-only counterparts. Then, we cluster elements into groups of 17 options for inference. |