Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding
Authors: Kenton Lee, Mandar Joshi, Iulia Raluca Turc, Hexiang Hu, Fangyu Liu, Julian Martin Eisenschlos, Urvashi Khandelwal, Peter Shaw, Ming-Wei Chang, Kristina Toutanova
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present Pix2Struct, a pretrained image-to-text model for purely visual language understanding, which can be finetuned on tasks containing visually-situated language. ... For the first time, we show that a single pretrained model can achieve state-of-the-art results in six out of nine tasks across four domains: documents, illustrations, user interfaces, and natural images. Experiments on four domains and nine tasks show that our finetuned models strongly outperform Donut... |
| Researcher Affiliation | Collaboration | 1Google Research 2succinctly.ai 3University of Cambridge. |
| Pseudocode | No | No pseudocode or algorithm blocks are present in the paper. |
| Open Source Code | Yes | For pretrained checkpoints and code, see https://github.com/google-research/pix2struct. |
| Open Datasets | Yes | We train two variants with 282M and 1.3B parameters... on 80M screenshots of web pages collected from the URLs in the C4 corpus (Raffel et al., 2020)... We create images of text snippets from the Books Corpus (Zhu et al., 2015)... |
| Dataset Splits | Yes | As a reference point, the base model reaches 30 BLEU and the large model reaches 32 BLEU on the pretraining validation set. ... We set aside 1% of the train split for validation. ... We finetune for 5000 or 10000 steps with a batch size of 32, 128, or 256, with hyperparameter tuning and early stopping based on the validation set. |
| Hardware Specification | Yes | The base model is then pretrained further for 270K steps with the screenshot parsing objective using a batch size of 2048 on 64 Google Cloud TPUs. The large model is pretrained for 170K steps with a batch size of 1024 on 128 Google Cloud TPUs. ... On the right of Figure 5, we also present example inference speeds on a v3-8 Cloud TPU when performing inference on Doc VQA. |
| Software Dependencies | No | The paper mentions using models like ViT and T5, and optimizers like Adafactor, but does not provide specific version numbers for software libraries or frameworks like TensorFlow, PyTorch, or specific Python libraries. |
| Experiment Setup | Yes | Both models use an input sequence length of 2048 patches and are optimized using Adafactor (Shazeer & Stern, 2018). The learning rate schedule uses a linear warmup of 1000 steps to 0.01, followed by cosine decay to 0. The decoder sequence length is 128 tokens, and we choose pretraining targets to have at most 1024 characters. ... Table 5 contains hyperparameter values for all tasks. |