reproducibilityindex.ai

GPT-4V(ision) is a Generalist Web Agent, if Grounded

Authors: Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, Yu Su

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate on the recent MIND2WEB benchmark. In addition to standard offline evaluation on cached websites, we enable a new online evaluation setting by developing a tool that allows running web agents on live websites.
Researcher Affiliation	Academia	Boyuan Zheng 1 Boyu Gou 1 Jihyung Kil 1 Huan Sun 1 Yu Su 1 1The Ohio State University, Columbus, OH.
Pseudocode	No	Not found. The paper includes structured prompts in Appendix D (e.g., Table 6), but these are not labeled as 'Pseudocode' or 'Algorithm' blocks.
Open Source Code	Yes	All code, data, and evaluation tools are available at https://github.com/ OSU-NLP-Group/SeeAct.
Open Datasets	Yes	We evaluate our methods on MIND2WEB (Deng et al., 2023)... This cleaned version of the dataset is called Multimodal Mind2Web, with the statistics in Table 1.2 The dataset is released at https://huggingface.co/ datasets/osunlp/Multimodal-Mind2Web.
Dataset Splits	No	Not found. Table 1 provides 'Train', 'Cross-Domain', 'Cross-Task', and 'Cross-Website' splits, but no explicit 'validation' split is mentioned.
Hardware Specification	No	Not found. The paper discusses various models and software tools used, but does not provide specific hardware details (e.g., GPU/CPU models, memory) for running the experiments.
Software Dependencies	No	Not found. The paper mentions various software components like Playwright, DeBERTa-base, BLIP-2, FLAN-T5, and the Supervision library, but generally lacks specific version numbers for these components, except for GPT model names (e.g., 'GPT-3.5-turbo-0613').
Experiment Setup	Yes	We adopt the evaluation metrics utilized in MIND2WEB. Element Accuracy (Ele. Acc) compares the predicted element with the ground-truth elements. Operation F1 (Op. F1) calculates the token-level F1 score for the predicted operation comprised of action and input value. Step Success Rate (Step SR) measures the success of each action step. ... In grounding via image annotation and textual choices, we first employ the De BERTa-base cross-encoder from Mind Act (Deng et al., 2023) to rank the top 50 elements for better comparison with its text-only counterparts. Then, we cluster elements into groups of 17 options for inference.