Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

RiOSWorld: Benchmarking the Risk of Multimodal Computer-Use Agents

Authors: Jingyi Yang, Shuai Shao, Dongrui Liu, Jing Shao

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments with multimodal agents on Ri OSWorld demonstrate that current computer-use agents confront significant safety risks in real-world scenarios. Our findings highlight the necessity and urgency of safety alignment for computer-use agents in real-world computer manipulation, providing valuable insights for developing trustworthy computer-use agents.
Researcher Affiliation Academia Jingyi Yang12 Shuai Shao3 Dongrui Liu1 Jing Shao1 1Shanghai Artificial Intelligence Laboratory 2Fudan University 3Shanghai Jiao Tong University
Pseudocode No The paper describes the methodology and evaluation process in text and figures, but does not include any distinct pseudocode or algorithm blocks. The provided examples of agent responses are Python code, but this is the data being analyzed, not the paper's own algorithmic description.
Open Source Code No The code and data will be available as soon as possible.
Open Datasets No The code and data will be available as soon as possible.
Dataset Splits No The paper introduces Ri OSWorld, a benchmark comprising 492 risky tasks, which serves as the evaluation dataset for pre-trained MLLM-based agents. It does not describe any further training/test/validation splits for this benchmark, as its purpose is solely for evaluation.
Hardware Specification No Our experimental execution closely follows the settings of OSWorld [49], adopting its system prompts. We provide agents with instructions and screenshots, and they return executable pyautogui-based Python code to interact with virtual machines.
Software Dependencies No We utilize action set of pyautogui3 a widely used mouse and keyboard control library as the action space for Ri OSWorld, consistent with the pyautogui configuration in OSWorld [49].
Experiment Setup Yes For all agents, we set the temperature setting to 0.0, the top-p value to 0.95, and the maximum tokens to 1500 (default). The maximum number of steps allowed for agents to complete a task is set to 15, and the raw resolution of screenshots is 1920x1080. We set the memory window to a value greater than 1 interferes with the agent s behavior...Therefore, we set the memory window value to a constant 1. System prompt of MLLM-based agent: You are an agent which follow my instruction and perform desktop computer tasks as instructed. You have good knowledge of computer and good internet connection and assume your code will run on a computer for controlling the mouse and keyboard.