Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Harnessing Webpage UIs for Text-Rich Visual Understanding

Authors: Junpeng Liu, Tianyue Ou, Yifan Song, Yuxiao Qu, Wai Lam, Chenyan Xiong, Wenhu Chen, Graham Neubig, Xiang Yue

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments show that training on Multi UI significantly improves model performance in both UI-related and general multimodal tasks. Notably, models trained on Multi UI achieved up to a 48% improvement on Visual Web Bench (Liu et al., 2024c) and a 19.1% increase in element accuracy on Mind2Web (Deng et al., 2023). More surprisingly, we observed that this training generalizes to non-UI domains, resulting in improved performance in document understanding (Mathew et al., 2021), OCR (Singh et al., 2019; Liu et al., 2023c), and chart interpretation (Masry et al., 2022) tasks outperforming even models specialized in these areas.
Researcher Affiliation Academia Carnegie Mellon University, The Chinese University of Hong Kong Peking University, University of Waterloo EMAIL EMAIL
Pseudocode No The paper describes the construction pipeline through four stages: (1) raw website data scraping, (2) website curation, (3) task extraction from scraped websites, and (4) instruction construction, but these are described in natural language without structured pseudocode or algorithm blocks.
Open Source Code No The paper mentions "Multi UI, an open-source dataset" but does not explicitly provide a statement or link for the open-source code of their methodology (UIX model or data generation pipeline script).
Open Datasets Yes To facilitate this, we introduce Multi UI, an open-source dataset containing 7.3 million samples spanning 1 million websites and various visual understanding tasks.
Dataset Splits Yes In this stage, we fine-tune the model on 95% of Multi UI dataset to enhance its web/UI-related understanding capabilities. ... and the remaining 5% of the Multi UI data.
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running experiments.
Software Dependencies Yes We developed UIX using Qwen2-7B-Instruct (Yang et al., 2024) as the primary LLM backbone. We also use Vicuna-7B-v1.5 (Chiang et al., 2023) and Llama-3.1-8B-Instruct (Meta, 2024) as backbones... We utilize the Llama-3-70B-Instruct (Dubey et al., 2024) as our filter model... we employ the GPT-4o-mini to generate rich and context-sensitive captions... Websites are rendered via Playwright.
Experiment Setup Yes We propose a two-stage training pipeline for our UIX models. Stage 1: GUI Knowledge Learning, where we fine-tune the model on 95% of Multi UI dataset. Stage 2: Visual Instruction Tuning, using LLaVA data and the remaining 5% of Multi UI data. We adopted a dynamic high-resolution strategy, dividing the input image into patches and incorporating a downsampled version of the entire image. Table 6 provides details including the Max Res. for each model.