ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Authors: Gilles Baechler, Srinivas Sunkara, Maria Wang, Fedir Zubach, Hassan Mansoor, Vincent Etter, Victor Carbune, Jason Lin, Jindong Chen, Abhanshu Sharma
IJCAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We run ablation studies to demonstrate the impact of these design choices. At only 5B parameters, Screen AI achieves new state-of-the-art results on UIand infographics-based tasks (Multipage Doc VQA, Web SRC, and Mo TIF), and new best-inclass performance on others (Chart QA, Doc VQA, and Infographic VQA) compared to models of similar size. Finally, we release three new datasets: one focused on the screen annotation task and two others focused on question answering. In this section, we present the setup we used to conduct our experiments and analyze our findings. |
| Researcher Affiliation | Industry | Gilles Baechler , Srinivas Sunkara , Maria Wang , Fedir Zubach , Hassan Mansoor , Vincent Etter , Victor C arbune , Jason Lin , Jindong Chen , Abhanshu Sharma Google Deep Mind jdchen@google.com |
| Pseudocode | No | The paper does not contain any sections or figures explicitly labeled as "Pseudocode" or "Algorithm", nor does it present any structured code-like blocks. |
| Open Source Code | No | Finally, we release three new datasets: one focused on the screen annotation task and two others focused on question answering. These are dataset releases, not the source code for the methodology/model itself. |
| Open Datasets | Yes | We release three evaluation datasets for tasks described in Section 4.2: Screen Annotation, Screen QA Short, and Complex Screen QA. These datasets enable the research community to utilize our textual representation and allow for a more comprehensive benchmarking of models for screen-based question answering. Screen Annotation (SA):4 To evaluate our model s layout annotation and spatial understanding capabilities, we create a dedicated benchmark consisting of 4.2K screenshots from the Rico dataset [Deka et al., 2017]. Screen QA Short (SQA Short):5 Screen QA [Hsiao et al., 2022], a benchmark for screen understanding, contains UI elements and full-sentence answers as ground truth. Complex Screen QA (Cplx SQA):6 To complement SQA Short, we introduce Complex Screen QA, which includes more difficult questions (counting, arithmetic, comparison, and non-answerable questions) and contains screens with various aspect ratios. In addition to these screen-related tasks, our training regimen also incorporates a variety of other image and text data sources: Span corruption on C4 [Xue et al., 2020], VQA CC3M [Sharma et al., 2018], Web LI Alt and OCR text [Kil et al., 2023; Chen et al., 2022] and Chart-to-table translation [Liu et al., 2023]. |
| Dataset Splits | No | The paper mentions modifying the "training set" for Multipage Doc VQA but does not provide specific split percentages or sample counts for training, validation, and test sets for all datasets used, particularly for their newly released ones. While it might implicitly use standard splits for established benchmarks, it does not explicitly state them. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for training or inference, such as GPU models, CPU types, or memory specifications. It only mentions being from Google DeepMind. |
| Software Dependencies | No | The paper mentions various models and architectures (e.g., Pa LI, pix2struct, Vi T, m T5, UL2, DETR, Pa LM 2-S) but does not provide specific version numbers for any software dependencies, libraries, or frameworks used for implementation. |
| Experiment Setup | Yes | In the fine-tuning phase, we hold the Vi T encoder frozen and fine-tune the language model only. We use 512 as our batch size for fine-tuning. Our text input sequence length is 128 and output sequence length varies depending on individual tasks. When fine-tuning with OCR as additional input, we increase the input sequence length accordingly. We generally find that the model converges within 30k steps. Unless specified otherwise, all experiments are run on the 5B model. |