reproducibilityindex.ai

Spotlight: Mobile UI Understanding using Vision-Language Models with a Focus

Authors: Gang Li, Yang Li

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments show that our model establishes SoTA results on several representative UI tasks and outperforms previous methods that use both screenshots and view hierarchies as inputs. 5 EXPERIMENTS
Researcher Affiliation	Industry	Gang Li Google Research Mountain View, CA leebird@google.com Yang Li Google Research Mountain View, CA liyang@google.com
Pseudocode	Yes	C PSEUDO CODE and Listing 1: Region Summarizer Psuedo Code
Open Source Code	No	The paper does not include an unambiguous statement that the authors are releasing the code for the work described in this paper, nor does it provide a direct link to such a repository. Links provided are for third-party pretrained models used in their work.
Open Datasets	Yes	First, we use the publicly available C4 corpus (Raffel et al., 2019) which contains a large amount of web pages that can be rendered into screenshots, similar to mobile UI screenshots. We use 80 million web page screenshots for pretraining.
Dataset Splits	Yes	We use the same dataset splits for comparison with these benchmarks. (Table 9: The dataset statistics of the four downstream tasks shows Train, Dev, Test splits with counts, e.g., Widget Captioning: Train 14,878, Dev 1,292, Test 1,265).
Hardware Specification	Yes	The model with Vi T B/16 is trained using 128 Google Cloud TPU v3 cores for 54 hours. The model with Vi T L/16 is trained using 256 Google Cloud TPU v3 cores for 86 hours.
Software Dependencies	No	The paper mentions basing models on T5 and ViT and links to their GitHub repositories but does not provide specific version numbers for these or other key software components or libraries (e.g., Python, PyTorch, TensorFlow, CUDA).
Experiment Setup	Yes	We use an image resolution [740, 740] for all our experiments. We use 0.1 dropout in all our models. For all the training experiments, we use a batch size of 128 and linear warmup/rsqrt decay for learning rate. During inference decoding, a beam size of 5 is used. The maximum text decoding length is 64 for each screen-object-text tuple. Pretraining: We pretrain the Spotlight models with a learning rate of 9e-3 for both B/16 (164K steps) and L/16 models (156K steps), with an initial linear warmup to 10k steps. Finetuning: For ﬁnetuning, we use a learning rate of 1e-3 and 20k steps for the Command Grounding task and 1e-4 and 10k steps for the other three tasks. Multi-Task Learning: We use a learning rate of 3e-4 for multi-task learning, and train the multi-task Spotlight models for 30k steps. The sampling weights are [3, 2, 15, 1] for the Widget Captioning, Screen Summarization, Command Grounding and Tappability tasks. We use 2 screen-object-text tuples in each example during training.