Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Harnessing Webpage UIs for Text-Rich Visual Understanding

Authors: Junpeng Liu, Tianyue Ou, Yifan Song, Yuxiao Qu, Wai Lam, Chenyan Xiong, Wenhu Chen, Graham Neubig, Xiang Yue

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments show that training on Multi UI significantly improves model performance in both UI-related and general multimodal tasks. Notably, models trained on Multi UI achieved up to a 48% improvement on Visual Web Bench (Liu et al., 2024c) and a 19.1% increase in element accuracy on Mind2Web (Deng et al., 2023). More surprisingly, we observed that this training generalizes to non-UI domains, resulting in improved performance in document understanding (Mathew et al., 2021), OCR (Singh et al., 2019; Liu et al., 2023c), and chart interpretation (Masry et al., 2022) tasks outperforming even models specialized in these areas.
Researcher Affiliation	Academia	Carnegie Mellon University, The Chinese University of Hong Kong Peking University, University of Waterloo EMAIL EMAIL
Pseudocode	No	The paper describes the construction pipeline through four stages: (1) raw website data scraping, (2) website curation, (3) task extraction from scraped websites, and (4) instruction construction, but these are described in natural language without structured pseudocode or algorithm blocks.
Open Source Code	No	The paper mentions "Multi UI, an open-source dataset" but does not explicitly provide a statement or link for the open-source code of their methodology (UIX model or data generation pipeline script).
Open Datasets	Yes	To facilitate this, we introduce Multi UI, an open-source dataset containing 7.3 million samples spanning 1 million websites and various visual understanding tasks.
Dataset Splits	Yes	In this stage, we fine-tune the model on 95% of Multi UI dataset to enhance its web/UI-related understanding capabilities. ... and the remaining 5% of the Multi UI data.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running experiments.
Software Dependencies	Yes	We developed UIX using Qwen2-7B-Instruct (Yang et al., 2024) as the primary LLM backbone. We also use Vicuna-7B-v1.5 (Chiang et al., 2023) and Llama-3.1-8B-Instruct (Meta, 2024) as backbones... We utilize the Llama-3-70B-Instruct (Dubey et al., 2024) as our filter model... we employ the GPT-4o-mini to generate rich and context-sensitive captions... Websites are rendered via Playwright.
Experiment Setup	Yes	We propose a two-stage training pipeline for our UIX models. Stage 1: GUI Knowledge Learning, where we fine-tune the model on 95% of Multi UI dataset. Stage 2: Visual Instruction Tuning, using LLaVA data and the remaining 5% of Multi UI data. We adopted a dynamic high-resolution strategy, dividing the input image into patches and incorporating a downsampled version of the entire image. Table 6 provides details including the Max Res. for each model.