Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

BTL-UI: Blink-Think-Link Reasoning Model for GUI Agent

Authors: Shaojie Zhang, Ruoceng Zhang, Pei Fu, Shaokang Wang, Jiahui Yang, Xin Du, ShiqiCui, Bin Qin, Ying Huang, Zhenbo Luo, Jian Luan

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We develop BTL-UI, a GUI agent trained via the BTL framework, and extensive experiment results demonstrate that the model achieves competitive performance across multiple GUI benchmarks.
Researcher Affiliation	Industry	Shaojie Zhang Ruoceng Zhang Pei Fu Shaokang Wang Jiahui Yang Xin Du Shiqi Cui Bin Qin Ying Huang Zhenbo Luo Jian Luan Mi LM Plus, Xiaomi Inc EMAIL
Pseudocode	No	The paper describes methods and formulas but does not include any clearly labeled pseudocode or algorithm blocks. For example, Section 3.2 "Blink Data Generation" describes a pipeline, but not as pseudocode.
Open Source Code	Yes	Equal contribution; Corresponding author. https://github.com/xiaomi-research/btl-ui
Open Datasets	Yes	As shown in Table 1, we train BTL-UI in a mix of grounding and planning data. Table 1: RFT data for BTL-UI. Category Source Size Grounding Show UI-Web [35] 1K Show UI-Desktop [35] 1K Low-Level Android Control [36] 500 GUI-Odyssey [37] 500 High-Level Android Control [36] 500 GUI-Odyssey [37] 500
Dataset Splits	No	The paper mentions that training data is "sampled from various datasets" and that a "random seed to 2025" is fixed for reproducibility during data sampling and training. However, it does not provide explicit training/test/validation split percentages or counts for the BTL-UI model training itself. While it lists datasets used for training and benchmarks for evaluation, the specific partitioning of these for the BTL-UI's own training process is not detailed.
Hardware Specification	Yes	Justification: We show the computer resources in experimental details (NVIDIA H100 GPUs) in experimental results.
Software Dependencies	No	The paper mentions using the "ms-swift framework [34] for RL training" and models like "Qwen2.5-VL-3B/7B" and "Qwen2.5-VL-32B [5]". However, specific version numbers for these frameworks or any other software libraries (e.g., Python, PyTorch, CUDA) are not provided.
Experiment Setup	Yes	Experimental Setup. We develop the BTL-UI-3B/7B model based on Qwen2.5-VL-3B/7B and adopt the ms-swift framework [34] for RL training. As shown in Table 1, we train BTL-UI in a mix of grounding and planning data. Moreover, we examine the effect of varying the number of Blink ROIs (λ): increasing λ from 1 to 6 steadily improves success rates from 66.6% to 69.2%, after which gains plateau, suggesting an optimal trade-off between annotation complexity and attention coverage. It is observed that from Table 6, as λ increases, the performance is saturated, so the final λ is selected as 5. In the data sampling process, we fix the random seed to 2025 to maintain reproducibility. And the sampled data is further adopted to generate Blink Data, following the pipeline in 3.2. Moreover, BTL-UI adopts the ms-swift [34] framework for RL training. During the training process, we also fix the random seed to 2025 to maintain reproducibility.