Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

UI-Genie: A Self-Improving Approach for Iteratively Boosting MLLM-based Mobile GUI Agents

Authors: Han Xiao, Guozhi Wang, Yuxiang Chai, Zimu Lu, Weifeng Lin, Hao He, Lue Fan, Liuyang Bian, Rui Hu, Liang Liu, Shuai Ren, yafei wen, xiaoxin chen, Aojun Zhou, Hongsheng Li

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results show that UI-Genie achieves state-of-the-art performance across multiple GUI agent benchmarks with three generations of data-model self-improvement.
Researcher Affiliation	Collaboration	Han Xiao1,2 , Guozhi Wang2, Yuxiang Chai1,2 , Zimu Lu1, Weifeng Lin1,2 , Hao He1, Lue Fan1, Liuyang Bian2, Rui Hu2, Liang Liu2, Shuai Ren2 B, Yafei Wen2, Xiaoxin Chen2, Aojun Zhou1 B, Hongsheng Li1,3,4 B 1CUHK MMLab 2vivo AI Lab 3Shanghai AI Laboratory 4Ace Robotics
Pseudocode	No	The paper describes the methods and processes through textual descriptions and diagrams (e.g., Figure 2, Figure 3) but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks with structured steps.
Open Source Code	Yes	We open-source our complete framework implementation and generated datasets to facilitate further research in https://github.com/Euphoria16/UI-Genie.
Open Datasets	Yes	We create and open-source two novel datasets (UI-Genie-RM-517k and UI-Genie-Agent-16k) along with our complete framework implementation, establishing the first reward-specific dataset for GUI agents and demonstrating synthetic trajectory generation without manual annotation. We open-source our complete framework implementation and generated datasets to facilitate further research in https://github.com/Euphoria16/UI-Genie.
Dataset Splits	Yes	We conduct a evaluation of UI-Genie-RM using a custom benchmark, since there is no established standard benchmark for GUI agent reward models. Our benchmark derives from test sets of three open-source datasets: Android Control [15], AMEX [5], and Android Lab [44]. For step-level evaluation, we sample 200 distinct ground truth actions as positive examples from each dataset, pairing each with a corresponding negative action generated by the agent model and verified through rule-based methods. For outcome-level evaluation, we include 200 ground truth trajectories as positive samples, complemented by an equal number of negative trajectories created through controlled trajectory corruption for Android Control and AMEX. We further augment this with 100 additional trajectories (50 successful, 50 failed) generated during Android Lab dynamic testing and validated using predefined rules.
Hardware Specification	Yes	After each cycle, we train the model using the Adam W optimizer with a learning rate of 1e-5 and a global batch size of 160 on 20 L40s machines with 4 GPUs each.
Software Dependencies	No	The paper mentions using 'Qwen2.5-VL family of models' as the backbone and 'Adam W optimizer' but does not specify version numbers for general software dependencies like Python, PyTorch, or CUDA.
Experiment Setup	Yes	We train the model using the Adam W optimizer with a learning rate of 1e-5 and a global batch size of 160.