Spotlight: Mobile UI Understanding using Vision-Language Models with a Focus

Authors: Gang Li, Yang Li

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments show that our model establishes SoTA results on several representative UI tasks and outperforms previous methods that use both screenshots and view hierarchies as inputs. 5 EXPERIMENTS
Researcher Affiliation Industry Gang Li Google Research Mountain View, CA leebird@google.com Yang Li Google Research Mountain View, CA liyang@google.com
Pseudocode Yes C PSEUDO CODE and Listing 1: Region Summarizer Psuedo Code
Open Source Code No The paper does not include an unambiguous statement that the authors are releasing the code for the work described in this paper, nor does it provide a direct link to such a repository. Links provided are for third-party pretrained models used in their work.
Open Datasets Yes First, we use the publicly available C4 corpus (Raffel et al., 2019) which contains a large amount of web pages that can be rendered into screenshots, similar to mobile UI screenshots. We use 80 million web page screenshots for pretraining.
Dataset Splits Yes We use the same dataset splits for comparison with these benchmarks. (Table 9: The dataset statistics of the four downstream tasks shows Train, Dev, Test splits with counts, e.g., Widget Captioning: Train 14,878, Dev 1,292, Test 1,265).
Hardware Specification Yes The model with Vi T B/16 is trained using 128 Google Cloud TPU v3 cores for 54 hours. The model with Vi T L/16 is trained using 256 Google Cloud TPU v3 cores for 86 hours.
Software Dependencies No The paper mentions basing models on T5 and ViT and links to their GitHub repositories but does not provide specific version numbers for these or other key software components or libraries (e.g., Python, PyTorch, TensorFlow, CUDA).
Experiment Setup Yes We use an image resolution [740, 740] for all our experiments. We use 0.1 dropout in all our models. For all the training experiments, we use a batch size of 128 and linear warmup/rsqrt decay for learning rate. During inference decoding, a beam size of 5 is used. The maximum text decoding length is 64 for each screen-object-text tuple. Pretraining: We pretrain the Spotlight models with a learning rate of 9e-3 for both B/16 (164K steps) and L/16 models (156K steps), with an initial linear warmup to 10k steps. Finetuning: For finetuning, we use a learning rate of 1e-3 and 20k steps for the Command Grounding task and 1e-4 and 10k steps for the other three tasks. Multi-Task Learning: We use a learning rate of 3e-4 for multi-task learning, and train the multi-task Spotlight models for 30k steps. The sampling weights are [3, 2, 15, 1] for the Widget Captioning, Screen Summarization, Command Grounding and Tappability tasks. We use 2 screen-object-text tuples in each example during training.