Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Harnessing Webpage UIs for Text-Rich Visual Understanding
Authors: Junpeng Liu, Tianyue Ou, Yifan Song, Yuxiao Qu, Wai Lam, Chenyan Xiong, Wenhu Chen, Graham Neubig, Xiang Yue
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments show that training on Multi UI significantly improves model performance in both UI-related and general multimodal tasks. Notably, models trained on Multi UI achieved up to a 48% improvement on Visual Web Bench (Liu et al., 2024c) and a 19.1% increase in element accuracy on Mind2Web (Deng et al., 2023). More surprisingly, we observed that this training generalizes to non-UI domains, resulting in improved performance in document understanding (Mathew et al., 2021), OCR (Singh et al., 2019; Liu et al., 2023c), and chart interpretation (Masry et al., 2022) tasks outperforming even models specialized in these areas. |
| Researcher Affiliation | Academia | Carnegie Mellon University, The Chinese University of Hong Kong Peking University, University of Waterloo EMAIL EMAIL |
| Pseudocode | No | The paper describes the construction pipeline through four stages: (1) raw website data scraping, (2) website curation, (3) task extraction from scraped websites, and (4) instruction construction, but these are described in natural language without structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper mentions "Multi UI, an open-source dataset" but does not explicitly provide a statement or link for the open-source code of their methodology (UIX model or data generation pipeline script). |
| Open Datasets | Yes | To facilitate this, we introduce Multi UI, an open-source dataset containing 7.3 million samples spanning 1 million websites and various visual understanding tasks. |
| Dataset Splits | Yes | In this stage, we fine-tune the model on 95% of Multi UI dataset to enhance its web/UI-related understanding capabilities. ... and the remaining 5% of the Multi UI data. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running experiments. |
| Software Dependencies | Yes | We developed UIX using Qwen2-7B-Instruct (Yang et al., 2024) as the primary LLM backbone. We also use Vicuna-7B-v1.5 (Chiang et al., 2023) and Llama-3.1-8B-Instruct (Meta, 2024) as backbones... We utilize the Llama-3-70B-Instruct (Dubey et al., 2024) as our filter model... we employ the GPT-4o-mini to generate rich and context-sensitive captions... Websites are rendered via Playwright. |
| Experiment Setup | Yes | We propose a two-stage training pipeline for our UIX models. Stage 1: GUI Knowledge Learning, where we fine-tune the model on 95% of Multi UI dataset. Stage 2: Visual Instruction Tuning, using LLaVA data and the remaining 5% of Multi UI data. We adopted a dynamic high-resolution strategy, dividing the input image into patches and incorporating a downsampled version of the entire image. Table 6 provides details including the Max Res. for each model. |