Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration

Authors: Junyang Wang, Haiyang Xu, Haitao Jia, Xi Zhang, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, Jitao Sang

NeurIPS 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results indicate that Mobile-Agent-v2 achieves over a 30% improvement in task completion compared to the single-agent architecture of Mobile-Agent.
Researcher Affiliation	Collaboration	1Beijing Jiaotong University 2Alibaba Group
Pseudocode	No	The paper describes the system architecture and agent functionalities in detail but does not include any pseudocode or algorithm blocks.
Open Source Code	Yes	The code is open-sourced at https://github.com/X-PLUG/Mobile Agent.
Open Datasets	No	The paper states: 'We select 5 system apps and 5 popular external apps for evaluation. For each app, we devise two basic instructions and two advanced instructions... In total, there were 88 instructions for non-English and English scenarios... The apps and instructions used for evaluation in non-English and English scenarios are presented in the appendix.' This refers to a custom-designed set of evaluation tasks, not a publicly available dataset with a link, DOI, or formal citation.
Dataset Splits	No	The paper describes a 'dynamic evaluation method' and a set of instructions used for evaluation but does not specify a division of these instructions into explicit training, validation, and test splits for model development or evaluation, as the MLLMs used are pre-trained via API calls.
Hardware Specification	No	For the MLLMs, the paper states: 'All calls are made through the official API method provided by the developers.', implying the use of cloud-based APIs without specifying the underlying hardware. No other specific hardware details (e.g., GPU models, CPU types, memory) for running experiments are provided.
Software Dependencies	No	The paper mentions specific MLLMs (GPT-4, GPT-4V, Gemini-1.5-Pro, Qwen-VL-Max) and tools (Conv Next Vi T-document, Grounding DINO, Qwen-VL-Int4), but it does not provide specific version numbers for general ancillary software components (e.g., Python, PyTorch, TensorFlow, specific libraries beyond the models themselves).
Experiment Setup	Yes	We fix the seed for GPT-4V invocation and set the temperature to 0 to avoid randomness.