Image Captioners Are Scalable Vision Learners Too

Authors: Michael Tschannen, Manoj Kumar, Andreas Steiner, Xiaohua Zhai, Neil Houlsby, Lucas Beyer

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we perform a fair comparison of these two pretraining strategies, carefully matching training data, compute, and model capacity. Using a standard encoder-decoder transformer, we find that captioning alone is surprisingly effective: on classification tasks, captioning produces vision encoders competitive with contrastively pretrained encoders, while surpassing them on vision & language tasks.
Researcher Affiliation Industry Michael Tschannen , Manoj Kumar Andreas Steiner Xiaohua Zhai Neil Houlsby Lucas Beyer Google Deep Mind
Pseudocode No The paper includes diagrams of model architectures (Figure 1) but no pseudocode or clearly labeled algorithm blocks.
Open Source Code Yes Code is available at https://github.com/google-research/big_vision.
Open Datasets Yes We use a subset of the Web LI data set [6] which contains 10B images and 12B multilingual alt-texts. ... Please refer to [6, Sec 2.2] for more details on the Web LI data set and to [6, Appendix B] for a datasheet. ... We also train some of our models and baselines on the smaller, publicly available LAION-400M dataset [55].
Dataset Splits Yes For each result shown in Fig. 3, we select the best setting using 1% of the training data that was held-out for this purpose, and report its accuracy on the 50 000 images in the validation set. ... For each dataset, we either use a provided held-out validation set for selecting the best settings, or hold out 20% of the training set if none is provided.
Hardware Specification Yes Table 1: Parameter count and TPUv4-hrs. per bn. examples seen.
Software Dependencies No The paper mentions software components and frameworks like 'sentence piece model' and refers to 'existing transformer code bases [52, 47, 57]' (T5, fairseq, Megatron-LM) but does not provide specific version numbers for any of these software dependencies.
Experiment Setup Yes We use a batch size of 8k for our captioning models... and both 8k and 16k for our retrained CLIP baselines... Models are trained on up to 9B image/alt-text pairs... We use the Ada Factor variant from [68] with a cosine schedule (with 10k warmup steps), and set learning rate and decay factor to 10 3 and 10 4, respectively. Images are resized to a resolution of 224 224, and alt-texts are tokenized to a 32k-sized vocabulary... with a maximum sequence length of 64.