ScreenAgent: A Vision Language Model-driven Computer Control Agent

Authors: Runliang Niu, Jindong Li, Shiqi Wang, Yali Fu, Xiyu Hu, Xueyuan Leng, He Kong, Yi Chang, Qi Wang

IJCAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Additionally, we construct the Screen Agent Dataset, which collects screenshots and action sequences when completing daily computer tasks. Finally, we train a model, Screen Agent, which achieves comparable computer control capabilities to GPT-4V and demonstrated more precise UI positioning capabilities.
Researcher Affiliation Academia 1 School of Artificial Intelligence, Jilin University 2 Engineering Research Center of Knowledge-Driven Human-Machine Intelligence, Ministry of Education, China niurl19@mails.jlu.edu.cn, qiwang@jlu.edu.cn
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes The code and more detailed information are at https://github.com/niuzaisheng/Screen Agent.
Open Datasets Yes The dataset has 273 complete task sessions, with 203 sessions (3005 screenshots) for training and 70 sessions (898 screenshots) for testing.
Dataset Splits No The paper explicitly mentions training and testing splits for their dataset, but does not provide details for a validation split for their dataset, nor for the other datasets used for fine-tuning.
Hardware Specification No The paper does not explicitly describe the hardware used for running its experiments, such as specific GPU or CPU models.
Software Dependencies No The paper does not provide specific software dependencies (e.g., library names with version numbers) needed to replicate the experiment.
Experiment Setup No The paper mentions fine-tuning a model and data mixing for training phases, but it does not provide specific experimental setup details such as hyperparameters (e.g., learning rate, batch size, epochs) or optimizer settings.