Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Enhancing GUI Agent with Uncertainty-Aware Self-Trained Evaluator

Authors: Gongwei Chen, Lirong Jie, Lexiao Zou, Weili Guan, Miao Zhang, Liqiang Nie

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The experiments reveal superior performances of URST in both in-domain (an average gain of 2.45%) and out-of-domain (an average gain of 2.89%) datasets, compared to the state-of-the-art promptingbased methods. This comparison suggests that domain-specific adaptation through self-training can outweigh the general reasoning power of large foundation models. Notably, evaluators (84.13%) trained with self-generated synthetic data exhibit significantly larger performance than those (79.01%) trained on data from proprietary MLLMs, highlighting the potential of self-generated data. We also implement some self-training works and compare them with URST. The results show that URST outperforms these methods by a margin of 6%, thanks to advanced sampling and policy optimization techniques. We also perform GUI agent training with the self-trained evaluator and obtain consistent performance improvements on two navigation datasets. Overall, our findings suggest reinforced self-training as a promising approach to training powerful GUI evaluators.
Researcher Affiliation Academia Gongwei Chen, Lirong Jie, Lexiao Zou, Weili Guan , Miao Zhang , Liqiang Nie Harbin Institute of Technology, Shenzhen EMAIL
Pseudocode Yes The whole procedure is shown in Algorithm 1.
Open Source Code Yes https://github.com/JL181818/URST
Open Datasets Yes We collect and construct a training set and three test sets for GUI trajectory evaluation. The training set is built on a subset of Android-in-the-Wild (AITW) datasets. As analyzed in [27], about 36% of the human demonstrations in this dataset are actually incorrect. We randomly sample 1500 trajectories from AITW training set, and use Qwen-VL-Max to generate the thoughts and judgments for supervised fine-tuning. In the self-training setting, we only sample 300 trajectories with the thoughts and judgments generated from Qwen-VL-Max as the initial training set. AITW-ID-traj and AITW-OOD-traj are in-domain and out-of-domain test sets built on AITW dataset. Each of these two test sets contains 120 tasks and was manually annotated in [27]. AITW-IDtraj and AITW-OOD-traj share the task goals, but have different trajectory distributions. Following OS-genesis [41], we also collect some agent-executed trajectories from an online environment, Android World. After manually annotation and filtering, we keep 223 trajectories and obtain a new out-of-domain test set, AW-OOD-traj.
Dataset Splits Yes We collect and construct a training set and three test sets for GUI trajectory evaluation. The training set is built on a subset of Android-in-the-Wild (AITW) datasets. [...] We randomly sample 1500 trajectories from AITW training set, and use Qwen-VL-Max to generate the thoughts and judgments for supervised fine-tuning. In the self-training setting, we only sample 300 trajectories with the thoughts and judgments generated from Qwen-VL-Max as the initial training set. AITW-ID-traj and AITW-OOD-traj are in-domain and out-of-domain test sets built on AITW dataset. Each of these two test sets contains 120 tasks and was manually annotated in [27]. [...] AW-OOD-traj test set containing 223 trajectories.
Hardware Specification Yes All experiments were conducted on 4 NVIDIA A100 40GB GPUs.
Software Dependencies No The paper mentions 'Deep Speed s Zero-3 optimization stage, and flash attention' and uses specific MLLMs ('Qwen VL-Max', 'Qwen2.5VL-3B', 'Qwen2VL-2B'), but does not provide specific version numbers for these software components or for general programming languages or frameworks like Python or PyTorch, which would be necessary for full reproducibility. The text 'All experiments are conducted using Deep Speed s Zero-3 optimization stage, and flash attention is employed to accelerate training' lacks version numbers.
Experiment Setup Yes The key hyperparameters used in our experiments are summarized in Table 8. The model is trained for 2 epochs using 300 samples annotated with Qwen-VL-Max during the initialization SFT stage. For SRPO training, each iteration involves 4 epochs of training on 400 samples sampled via URST. Consequently, after 3 iterations, the total number of training samples amounts to 1500. Both Initialization and subsequent SGPO training are conducted with full-parameter fine-tuning. In each Iteration, the learning rate is warmed up linearly from 0 to 1e-6 across 5 global steps and then reduced to a minimum of 0 using cosine decay. We adopt a β value of 0.001, which balances the reward signal and divergence constraint in the policy update. To manage training efficiency and computational cost, the maximum pixel limit for each visual input was set at 802,816. If an input image exceeds this limit, it is cropped and resized while preserving the original aspect ratio. To further enhance memory efficiency and scalability, all experiments are conducted using Deep Speed s Zero-3 optimization stage, and flash attention is employed to accelerate training. During both the uncertainty-aware sampling and SGPO training, we apply the temperature set to 1.0, top-k sampling with k = 50, and nucleus (top-p) sampling with p = 0.9 to ensure the diversity of the outputs. Table 8: Hyperparameter settings used in the experiments. Hyperparameter Value SFT training epoch at initialization stage 2 SRPO training epoch 4 sample size per iteration 400 β 0.001 learning rate 1e-6 warmup ratio 0.05 max pixels 802,816 per device train batch size 2 Deep Speed optimization stage Zero-3 temperature 1 top-k 50 top-p 0.9