ScreenAgent: A Vision Language Model-driven Computer Control Agent
Authors: Runliang Niu, Jindong Li, Shiqi Wang, Yali Fu, Xiyu Hu, Xueyuan Leng, He Kong, Yi Chang, Qi Wang
IJCAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Additionally, we construct the Screen Agent Dataset, which collects screenshots and action sequences when completing daily computer tasks. Finally, we train a model, Screen Agent, which achieves comparable computer control capabilities to GPT-4V and demonstrated more precise UI positioning capabilities. |
| Researcher Affiliation | Academia | 1 School of Artiļ¬cial Intelligence, Jilin University 2 Engineering Research Center of Knowledge-Driven Human-Machine Intelligence, Ministry of Education, China niurl19@mails.jlu.edu.cn, qiwang@jlu.edu.cn |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code and more detailed information are at https://github.com/niuzaisheng/Screen Agent. |
| Open Datasets | Yes | The dataset has 273 complete task sessions, with 203 sessions (3005 screenshots) for training and 70 sessions (898 screenshots) for testing. |
| Dataset Splits | No | The paper explicitly mentions training and testing splits for their dataset, but does not provide details for a validation split for their dataset, nor for the other datasets used for fine-tuning. |
| Hardware Specification | No | The paper does not explicitly describe the hardware used for running its experiments, such as specific GPU or CPU models. |
| Software Dependencies | No | The paper does not provide specific software dependencies (e.g., library names with version numbers) needed to replicate the experiment. |
| Experiment Setup | No | The paper mentions fine-tuning a model and data mixing for training phases, but it does not provide specific experimental setup details such as hyperparameters (e.g., learning rate, batch size, epochs) or optimizer settings. |