Image Captioners Are Scalable Vision Learners Too
Authors: Michael Tschannen, Manoj Kumar, Andreas Steiner, Xiaohua Zhai, Neil Houlsby, Lucas Beyer
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we perform a fair comparison of these two pretraining strategies, carefully matching training data, compute, and model capacity. Using a standard encoder-decoder transformer, we find that captioning alone is surprisingly effective: on classification tasks, captioning produces vision encoders competitive with contrastively pretrained encoders, while surpassing them on vision & language tasks. |
| Researcher Affiliation | Industry | Michael Tschannen , Manoj Kumar Andreas Steiner Xiaohua Zhai Neil Houlsby Lucas Beyer Google Deep Mind |
| Pseudocode | No | The paper includes diagrams of model architectures (Figure 1) but no pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | Yes | Code is available at https://github.com/google-research/big_vision. |
| Open Datasets | Yes | We use a subset of the Web LI data set [6] which contains 10B images and 12B multilingual alt-texts. ... Please refer to [6, Sec 2.2] for more details on the Web LI data set and to [6, Appendix B] for a datasheet. ... We also train some of our models and baselines on the smaller, publicly available LAION-400M dataset [55]. |
| Dataset Splits | Yes | For each result shown in Fig. 3, we select the best setting using 1% of the training data that was held-out for this purpose, and report its accuracy on the 50 000 images in the validation set. ... For each dataset, we either use a provided held-out validation set for selecting the best settings, or hold out 20% of the training set if none is provided. |
| Hardware Specification | Yes | Table 1: Parameter count and TPUv4-hrs. per bn. examples seen. |
| Software Dependencies | No | The paper mentions software components and frameworks like 'sentence piece model' and refers to 'existing transformer code bases [52, 47, 57]' (T5, fairseq, Megatron-LM) but does not provide specific version numbers for any of these software dependencies. |
| Experiment Setup | Yes | We use a batch size of 8k for our captioning models... and both 8k and 16k for our retrained CLIP baselines... Models are trained on up to 9B image/alt-text pairs... We use the Ada Factor variant from [68] with a cosine schedule (with 10k warmup steps), and set learning rate and decay factor to 10 3 and 10 4, respectively. Images are resized to a resolution of 224 224, and alt-texts are tokenized to a 32k-sized vocabulary... with a maximum sequence length of 64. |