Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

OpenCUA: Open Foundations for Computer-Use Agents

Authors: Xinyuan Wang, Bowen Wang, Dunjie Lu, Junlin Yang, Tianbao Xie, Junli Wang, Jiaqi Deng, Xiaole Guo, Yiheng Xu, Chen Wu, Zhennan Shen, Zhuokai Li, Ryan Li, Xiaochuan Li, Junda Chen, Boyuan Zheng, LI PEIHANG, Fangyu Lei, Ruisheng Cao, Yeqiao Fu, Dongchan Shin, Martin Shin, Hu Jiarui, Yuyan Wang, Jixuan Chen, Yuxiao Ye, Danyang Zhang, Yipu Wang, Heng Wang, Diyi Yang, Victor Zhong, Y.Charles, Zhilin Yang, Tao Yu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our end-to-end agent models demonstrate strong performance across CUA benchmarks. In particular, OPENCUA-72B achieves an average success rate of 45.0% on OSWorld-Verified, establishing a new state-of-the-art (SOTA) among open-source models. Further analysis confirms that our approach generalizes well across domains and benefits significantly from increased test-time computation. We release our annotation tool, datasets, code, and models to build open foundations for further CUA research: https://opencua.xlang.ai
Researcher Affiliation	Collaboration	x XLANG Lab, The University of Hong Kong m Moonshot AI s Stanford University w University of Waterloo c Carnegie Mellon University
Pseudocode	No	The paper describes methods and processes in structured text and figures (e.g., Figure 2: Overview of the OPENCUA framework, Figure 4: Reflective long Co T synthesis pipeline), but does not contain explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	We release our annotation tool, datasets, code, and models to build open foundations for further CUA research: https://opencua.xlang.ai
Open Datasets	Yes	We then collect the AGENTNET dataset, including 22.6K open-domain computer task trajectories spanning over 100 applications and 200 websites across Windows, mac OS, and Ubuntu (Figure 2 top right). ... We release our annotation tool, datasets, code, and models to build open foundations for further CUA research: https://opencua.xlang.ai
Dataset Splits	Yes	To study model generalization, we split data into Windows/mac OS and Ubuntu, ensuring no overlap with OSWorld tasks to prevent data leakage. All tasks were manually verified and labeled as rejected, ok, good, or excellent based on goal clarity, diversity, and complexity. ... we curated AGENTNETBENCH based on our collected human demonstrations (Figure 2 bottom right). This offline benchmark provides multiple gold-standard actions per step, efficiently approximating online metrics to dramatically accelerate agent evaluation and development. ... AGENTNETBENCH, comprising 100 representative tasks selected from the AGENTNET dataset.
Hardware Specification	Yes	All models are trained on the Kimi Team’s infrastructure with the Megatron framework and Deep Speed (Ze RO-3). ... on 96 A100 GPUs. ... 224 A100. ... 128 A100. ... 480 A100.
Software Dependencies	No	The paper mentions several frameworks and models used (e.g., Megatron framework, Deep Speed (Ze RO-3), claude-3-7-sonnet-20250219, Duck Track, Open Adapt, OBS Studio), but does not provide specific version numbers for general software dependencies like Python or PyTorch, which are critical for full reproducibility.
Experiment Setup	Yes	All models are trained on the Kimi Team’s infrastructure with the Megatron framework and Deep Speed (Ze RO-3). We employ three training strategies: 1. Stage-2 only. OPENCUA-QWEN2-7B and OPENCUA-A3B share a configuration of sequence length 32,768, learning-rate 2 × 10−5, weight-decay 0.1, and global batch size 384 (512 in ablations) on 96 A100GPUs. They are trained on 18k Win&mac OS + 10k Ubuntu trajectories. OPENCUA-QWEN2-7B runs for 3,400 steps (about 45 h) after a 400-step grounding warm-up; OPENCUA-A3B runs for 2,000 steps (about 10 h).