Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Harnessing Webpage UIs for Text-Rich Visual Understanding
Authors: Junpeng Liu, Tianyue Ou, Yifan Song, Yuxiao Qu, Wai Lam, Chenyan Xiong, Wenhu Chen, Graham Neubig, Xiang Yue
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments show that training on Multi UI significantly improves model performance in both UI-related and general multimodal tasks. Notably, models trained on Multi UI achieved up to a 48% improvement on Visual Web Bench (Liu et al., 2024c) and a 19.1% increase in element accuracy on Mind2Web (Deng et al., 2023). More surprisingly, we observed that this training generalizes to non-UI domains, resulting in improved performance in document understanding (Mathew et al., 2021), OCR (Singh et al., 2019; Liu et al., 2023c), and chart interpretation (Masry et al., 2022) tasks outperforming even models specialized in these areas. |
| Researcher Affiliation | Academia | Carnegie Mellon University, The Chinese University of Hong Kong Peking University, University of Waterloo EMAIL EMAIL |
| Pseudocode | No | The paper describes the construction pipeline through four stages: (1) raw website data scraping, (2) website curation, (3) task extraction from scraped websites, and (4) instruction construction, but these are described in natural language without structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper mentions "Multi UI, an open-source dataset" but does not explicitly provide a statement or link for the open-source code of their methodology (UIX model or data generation pipeline script). |
| Open Datasets | Yes | To facilitate this, we introduce Multi UI, an open-source dataset containing 7.3 million samples spanning 1 million websites and various visual understanding tasks. |
| Dataset Splits | Yes | In this stage, we fine-tune the model on 95% of Multi UI dataset to enhance its web/UI-related understanding capabilities. ... and the remaining 5% of the Multi UI data. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running experiments. |
| Software Dependencies | Yes | We developed UIX using Qwen2-7B-Instruct (Yang et al., 2024) as the primary LLM backbone. We also use Vicuna-7B-v1.5 (Chiang et al., 2023) and Llama-3.1-8B-Instruct (Meta, 2024) as backbones... We utilize the Llama-3-70B-Instruct (Dubey et al., 2024) as our filter model... we employ the GPT-4o-mini to generate rich and context-sensitive captions... Websites are rendered via Playwright. |
| Experiment Setup | Yes | We propose a two-stage training pipeline for our UIX models. Stage 1: GUI Knowledge Learning, where we fine-tune the model on 95% of Multi UI dataset. Stage 2: Visual Instruction Tuning, using LLaVA data and the remaining 5% of Multi UI data. We adopted a dynamic high-resolution strategy, dividing the input image into patches and incorporating a downsampled version of the entire image. Table 6 provides details including the Max Res. for each model. |