reproducibilityindex.ai

ScreenAI: A Vision-Language Model for UI and Infographics Understanding

Authors: Gilles Baechler, Srinivas Sunkara, Maria Wang, Fedir Zubach, Hassan Mansoor, Vincent Etter, Victor Carbune, Jason Lin, Jindong Chen, Abhanshu Sharma

IJCAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We run ablation studies to demonstrate the impact of these design choices. At only 5B parameters, Screen AI achieves new state-of-the-art results on UIand infographics-based tasks (Multipage Doc VQA, Web SRC, and Mo TIF), and new best-inclass performance on others (Chart QA, Doc VQA, and Infographic VQA) compared to models of similar size. Finally, we release three new datasets: one focused on the screen annotation task and two others focused on question answering. In this section, we present the setup we used to conduct our experiments and analyze our ﬁndings.
Researcher Affiliation	Industry	Gilles Baechler , Srinivas Sunkara , Maria Wang , Fedir Zubach , Hassan Mansoor , Vincent Etter , Victor C arbune , Jason Lin , Jindong Chen , Abhanshu Sharma Google Deep Mind jdchen@google.com
Pseudocode	No	The paper does not contain any sections or figures explicitly labeled as "Pseudocode" or "Algorithm", nor does it present any structured code-like blocks.
Open Source Code	No	Finally, we release three new datasets: one focused on the screen annotation task and two others focused on question answering. These are dataset releases, not the source code for the methodology/model itself.
Open Datasets	Yes	We release three evaluation datasets for tasks described in Section 4.2: Screen Annotation, Screen QA Short, and Complex Screen QA. These datasets enable the research community to utilize our textual representation and allow for a more comprehensive benchmarking of models for screen-based question answering. Screen Annotation (SA):4 To evaluate our model s layout annotation and spatial understanding capabilities, we create a dedicated benchmark consisting of 4.2K screenshots from the Rico dataset [Deka et al., 2017]. Screen QA Short (SQA Short):5 Screen QA [Hsiao et al., 2022], a benchmark for screen understanding, contains UI elements and full-sentence answers as ground truth. Complex Screen QA (Cplx SQA):6 To complement SQA Short, we introduce Complex Screen QA, which includes more difﬁcult questions (counting, arithmetic, comparison, and non-answerable questions) and contains screens with various aspect ratios. In addition to these screen-related tasks, our training regimen also incorporates a variety of other image and text data sources: Span corruption on C4 [Xue et al., 2020], VQA CC3M [Sharma et al., 2018], Web LI Alt and OCR text [Kil et al., 2023; Chen et al., 2022] and Chart-to-table translation [Liu et al., 2023].
Dataset Splits	No	The paper mentions modifying the "training set" for Multipage Doc VQA but does not provide specific split percentages or sample counts for training, validation, and test sets for all datasets used, particularly for their newly released ones. While it might implicitly use standard splits for established benchmarks, it does not explicitly state them.
Hardware Specification	No	The paper does not provide specific details about the hardware used for training or inference, such as GPU models, CPU types, or memory specifications. It only mentions being from Google DeepMind.
Software Dependencies	No	The paper mentions various models and architectures (e.g., Pa LI, pix2struct, Vi T, m T5, UL2, DETR, Pa LM 2-S) but does not provide specific version numbers for any software dependencies, libraries, or frameworks used for implementation.
Experiment Setup	Yes	In the ﬁne-tuning phase, we hold the Vi T encoder frozen and ﬁne-tune the language model only. We use 512 as our batch size for ﬁne-tuning. Our text input sequence length is 128 and output sequence length varies depending on individual tasks. When ﬁne-tuning with OCR as additional input, we increase the input sequence length accordingly. We generally ﬁnd that the model converges within 30k steps. Unless speciﬁed otherwise, all experiments are run on the 5B model.