Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Image Captioners Are Scalable Vision Learners Too
Authors: Michael Tschannen, Manoj Kumar, Andreas Steiner, Xiaohua Zhai, Neil Houlsby, Lucas Beyer
NeurIPS 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we perform a fair comparison of these two pretraining strategies, carefully matching training data, compute, and model capacity. Using a standard encoder-decoder transformer, we find that captioning alone is surprisingly effective: on classification tasks, captioning produces vision encoders competitive with contrastively pretrained encoders, while surpassing them on vision & language tasks. |
| Researcher Affiliation | Industry | Michael Tschannen , Manoj Kumar Andreas Steiner Xiaohua Zhai Neil Houlsby Lucas Beyer Google Deep Mind |
| Pseudocode | No | The paper includes diagrams of model architectures (Figure 1) but no pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | Yes | Code is available at https://github.com/google-research/big_vision. |
| Open Datasets | Yes | We use a subset of the Web LI data set [6] which contains 10B images and 12B multilingual alt-texts. ... Please refer to [6, Sec 2.2] for more details on the Web LI data set and to [6, Appendix B] for a datasheet. ... We also train some of our models and baselines on the smaller, publicly available LAION-400M dataset [55]. |
| Dataset Splits | Yes | For each result shown in Fig. 3, we select the best setting using 1% of the training data that was held-out for this purpose, and report its accuracy on the 50 000 images in the validation set. ... For each dataset, we either use a provided held-out validation set for selecting the best settings, or hold out 20% of the training set if none is provided. |
| Hardware Specification | Yes | Table 1: Parameter count and TPUv4-hrs. per bn. examples seen. |
| Software Dependencies | No | The paper mentions software components and frameworks like 'sentence piece model' and refers to 'existing transformer code bases [52, 47, 57]' (T5, fairseq, Megatron-LM) but does not provide specific version numbers for any of these software dependencies. |
| Experiment Setup | Yes | We use a batch size of 8k for our captioning models... and both 8k and 16k for our retrained CLIP baselines... Models are trained on up to 9B image/alt-text pairs... We use the Ada Factor variant from [68] with a cosine schedule (with 10k warmup steps), and set learning rate and decay factor to 10 3 and 10 4, respectively. Images are resized to a resolution of 224 224, and alt-texts are tokenized to a 32k-sized vocabulary... with a maximum sequence length of 64. |