reproducibilityindex.ai

ScreenAgent: A Vision Language Model-driven Computer Control Agent

Authors: Runliang Niu, Jindong Li, Shiqi Wang, Yali Fu, Xiyu Hu, Xueyuan Leng, He Kong, Yi Chang, Qi Wang

IJCAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Additionally, we construct the Screen Agent Dataset, which collects screenshots and action sequences when completing daily computer tasks. Finally, we train a model, Screen Agent, which achieves comparable computer control capabilities to GPT-4V and demonstrated more precise UI positioning capabilities.
Researcher Affiliation	Academia	1 School of Artiﬁcial Intelligence, Jilin University 2 Engineering Research Center of Knowledge-Driven Human-Machine Intelligence, Ministry of Education, China niurl19@mails.jlu.edu.cn, qiwang@jlu.edu.cn
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code	Yes	The code and more detailed information are at https://github.com/niuzaisheng/Screen Agent.
Open Datasets	Yes	The dataset has 273 complete task sessions, with 203 sessions (3005 screenshots) for training and 70 sessions (898 screenshots) for testing.
Dataset Splits	No	The paper explicitly mentions training and testing splits for their dataset, but does not provide details for a validation split for their dataset, nor for the other datasets used for fine-tuning.
Hardware Specification	No	The paper does not explicitly describe the hardware used for running its experiments, such as specific GPU or CPU models.
Software Dependencies	No	The paper does not provide specific software dependencies (e.g., library names with version numbers) needed to replicate the experiment.
Experiment Setup	No	The paper mentions fine-tuning a model and data mixing for training phases, but it does not provide specific experimental setup details such as hyperparameters (e.g., learning rate, batch size, epochs) or optimizer settings.