Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents
Authors: Qianhui Wu, Kanzhi Cheng, Rui Yang, Chaoyun Zhang, Jianwei Yang, Huiqiang Jiang, Jian Mu, Baolin Peng, Bo Qiao, Reuben Tan, Si Qin, Lars Liden, Qingwei Lin, Huan Zhang, Tong Zhang, Jianbing Zhang, Dongmei Zhang, Jianfeng Gao
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments show that GUI-Actor outperforms prior state-of-the-art methods on multiple GUI action grounding benchmarks, with improved generalization to unseen screen resolutions and layouts. |
| Researcher Affiliation | Collaboration | 1Microsoft 2Nanjing University 3University of Illinois Urbana-Champaign |
| Pseudocode | Yes | The following Python code outlines the visualization process: starting from the raw attention scores, we normalize and reshape them to match the image dimensions, apply a colormap for clearer interpretation, and finally blend the attention heatmap with the original image. This produces an intuitive overlay that highlights regions the model attends to when making decisions. [...] Listing 1: Python code for overlaying the attention score map on the image. |
| Open Source Code | No | We plan to release all data and code upon passing internal review and approval processes. |
| Open Datasets | Yes | We compile our training data from several publicly available, high-quality GUI datasets. Summary statistics are provided in Table 7. Note that we exclude samples from Wave-UI that overlap with downstream task test sets. Table 7: Overview of training datasets used for GUI-Actor. Dataset # of Elements # of Screenshots Platform Uground Web-Hybrid [8] 8M 775K Web GUI-Env [23] 262K 70K Web GUI-Act [23] 42K 13K Web Android Control [53] 47K 47K Android AMEX [24] 1.2M 100K Android Wave-UI1 50K 7K Hybrid Total 9.6M 1M - |
| Dataset Splits | No | Our full training recipe is built from several public GUI datasets, comprising 1M screenshots. Both GUI-Actor and the two baseline models are trained using the data recipe summarized in Table 7 for 1 epoch. The paper evaluates on separate benchmarks (Screen Spot, Screen Spot-v2, Screen Spot-Pro) but does not provide specific training/validation/test splits for the collective dataset used for training (Table 7). |
| Hardware Specification | No | The paper mentions implementing GUI-Actor using PyTorch and Huggingface Transformers, but does not specify the hardware (e.g., GPU models, CPU models, memory) used for running experiments. |
| Software Dependencies | No | We implement GUI-Actor using Py Torch and Huggingface Transformers. No specific version numbers are provided for these software dependencies. |
| Experiment Setup | Yes | We implement GUI-Actor using Py Torch and Huggingface Transformers. Unless otherwise specified, we adopt Qwen-2-VL-7B-Instruct [38] as the backbone VLM... The number of attention heads in the self-attention layer is set to 8; Both MLP components are two-layer feedforward networks with a GELU activation in between. We use the same dimensionality as the backbone VLM for all configurations of the action head. The grounding verifier is fine-tuned from UI-TARS-2B-SFT [10]. During inference, we construct a pool of K = 20 candidates and apply a confidence threshold of γ = 0.95 for tasks in Screen Spot-Pro and γ = 0.8 for Screen Spot and Screen Spot-v2. ... Both GUI-Actor and the two baseline models are trained using the data recipe summarized in Table 7 for 1 epoch. ... To train GUI-Actor, we begin by freezing all backbone VLM parameters and training only the newly introduced components of the action head (∼20M/∼100M parameters for 2B/7B backbone). After this warm-up phase, we fine-tune the entire model using standard supervised learning. |