Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

MobileUse: A Hierarchical Reflection-Driven GUI Agent for Autonomous Mobile Operation

Authors: Ning Li, Xiangmou Qu, Jiamu Zhou, Muning Wen, Kounianhua Du, Xingyu Lou, Qiuying Peng, Jun Wang, Weinan Zhang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Evaluations on the Android World and Android Lab benchmarks demonstrate that Mobile Use establishes new state-of-the-art performance, achieving success rates of 62.9% and 44.2%, respectively. To address these challenges, we propose Mobile Use, a GUI agent designed for robust and adaptive mobile task execution. Empirical results show that Mobile Use achieves state-of-the-art (SOTA) performance, with success rates of 62.9% and 44.2%, respectively. With comprehensive ablation studies and analyses, we highlight the effectiveness of hierarchical reflection and proactive exploration in solving complex tasks.
Researcher Affiliation	Collaboration	Ning Li1 , Xiangmou Qu2 , Jiamu Zhou2 , Jun Wang2, Muning Wen1, Kounianhua Du1, Xingyu Lou2, Qiuying Peng2 , Jun Wang2 , Weinan Zhang1 1Shanghai Jiao Tong University 2OPPO Research Institute EMAIL, EMAIL
Pseudocode	No	The paper describes the framework components and their interactions using prose and diagrams (Figures 1 and 2), and lists an action space in Appendix B, but does not include any clearly labeled 'Pseudocode' or 'Algorithm' blocks with structured steps in a code-like format.
Open Source Code	Yes	To facilitate real-world applications, we release an out-of-the-box toolkit for automated task execution on physical mobile devices, which is available at https://github.com/MadeAgents/mobile-use.
Open Datasets	Yes	Benchmarks. We evaluate the performance of Mobile Use on two widely-used mobile benchmarks: Android World (Rawles et al., 2025) and Android Lab (Xu et al., 2024a).
Dataset Splits	Yes	Both benchmarks provide controllable Android interaction environments, standardized task initialization procedures, and well-defined automated evaluation processes, ensuring consistency in evaluation.
Hardware Specification	Yes	Our experiments are training-free and utilize two types of computational resources. The first type is the multi-modal large language model service, which we deploy using v LLM(Kwon et al., 2023) on a machine running Ubuntu 20.04 with 8 CPU cores, 100 GB of memory, and 4 A100 GPUs (each with 80 GB of VRAM).
Software Dependencies	No	We use the open-source multimodal language model Qwen2.5-VL-72B-Instruct (Bai et al., 2025) with temperature = 0 for our base model. Our experiments are training-free and utilize two types of computational resources. The first type is the multi-modal large language model service, which we deploy using v LLM(Kwon et al., 2023) on a machine running Ubuntu 20.04. Integrated with Gradio(Abid et al., 2019).
Experiment Setup	Yes	We use the open-source multimodal language model Qwen2.5-VL-72B-Instruct (Bai et al., 2025) with temperature = 0 for our base model. Only when ˆct θ, where θ is a predefined threshold, the Action Reflector is invoked to reflect on the current step. In Mobile Use, we choose j = max(0, t 3) to balance the efficiency. We set the total number of exploration steps to 100 per app, with a time cost of 19.5 seconds per step.