Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
macOSWorld: A Multilingual Interactive Benchmark for GUI Agents
Authors: Pei Yang, Hai Ci, Mike Zheng Shou
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To address these gaps, we introduce mac OSWorld (Figure 1), the first comprehensive benchmark for evaluating GUI agents on mac OS environments. Our contributions include: ... Comprehensive evaluation of six representative GUI agents, revealing performance tiers, language-specific capabilities, and systematic failure patterns that highlight current limitations and future research directions. The benchmarking results reveal distinct performance tiers, with proprietary computer use agents (CUAs) achieving over 30% success rate while open-source research models struggle at below 5%. |
| Researcher Affiliation | Academia | Pei Yang Hai Ci Mike Zheng Shou Show Lab, National University of Singapore EMAIL, EMAIL, EMAIL |
| Pseudocode | Yes | Preparation Script osascript -e 'tell application "Numbers" to activate' && sleep 5 && osascript -e 'tell application "Numbers" to make new document with properties {document template:template "Personal Budget"}' ... Evaluation Script osascript -e 'tell application "Numbers" to tell table 2 of sheet 3 of document 1 to get value of cell "C6"' 2>/dev/null | grep -q "true" && osascript -e 'tell application "Numbers" to tell table 2 of sheet 3 of document 1 to get value of cell "C7"' 2>/dev/null | grep -q "false" && echo "True" || echo "False"' |
| Open Source Code | Yes | Project page: https://macos-world.github.io. (from abstract) and Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: Please refer to https://macos-world.github.io/. |
| Open Datasets | Yes | To bridge the gaps, we present mac OSWorld, the first comprehensive benchmark for evaluating GUI agents on mac OS. mac OSWorld features 202 multilingual interactive tasks across 30 applications (28 mac OSexclusive), with task instructions and OS interfaces offered in 5 languages (English, Chinese, Arabic, Japanese, and Russian). Project page: https://macos-world.github.io. Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: Please refer to https://macos-world.github.io/. |
| Dataset Splits | Yes | mac OSWorld comprises 202 interactive tasks organized into seven categories, of which 171 are available in five languages. To assess this risk, we introduce a dedicated safety benchmarking subset. ... we form the safety subset by first randomly sampling 29 tasks from the main dataset, and then manually annotating 29 unique dialogs containing deceptive content paraphrased from each corresponding task. |
| Hardware Specification | Yes | To comply with Apple software s EULA, mac OSWorld runs in mac OS EC2 instances on AWS-hosted dedicated Apple hardware. These hosts are genuine Mac minis with custom firmware that boot mac OS from external hardware... All experiments run in a mac OS virtual instance at 1024 768 pixels (the default for mac OS and for Claude CUA [29]), physically on AWS-hosted Mac Minis (model A2348). |
| Software Dependencies | Yes | The mac OSWorld environment runs mac OS Sequoia 15.2, with at least 30 applications involved in benchmarking. We evaluate six representative GUI agents as baselines: two proprietary computer-use agents (Open AI Computer-Using Agent [28], computer-use-preview-2025-03-11; Claude Computer-Use Agent [29], claude-3-7-sonnet-20250219 with computer-use-2025-01-24 and token-efficient-tools-2025-02-19 betas), two general VLM-based agents (GPT-4o [46], gpt-4o-2024-08-06; Gemini 2.5 Pro [47], gemini-2.5-pro-preview-03-25), and two community open-source GUI agents (Show UI 2B [25]; UI-TARS 7B DPO [2], chain-of-thought [48] enabled in Chinese). |
| Experiment Setup | Yes | Benchmark Configurations All experiments run in a mac OS virtual instance at 1024 768 pixels (the default for mac OS and for Claude CUA [29]), physically on AWS-hosted Mac Minis (model A2348). Except for the Set-of-Mark (So M) ablation (Table 8), all agents perceive the environment from the most recent 3 screenshot only. Each task is granted up to 15 screenshots or 30 dialog turns, whichever limit is reached first. Agents retain the full conversation history, but prune screenshots to the most recent 3 (with Show UI [25] being an exception, following its own context format [7]). Tasks are benchmarked in five languages English (en), Chinese (zh), Arabic (ar), Japanese (jp), and Russian (ru) with both the task prompt and system UI set to the target language. The GPT-4o [46] agent was implemented by prompting the agent each time with T=1 and top_p=0.9, with the following content blocks: <System Prompt > <User Query > (including screenshots) |