SimVLM: Simple Visual Language Model Pretraining with Weak Supervision

Authors: Zirui Wang, Jiahui Yu, Adams Wei Yu, Zihang Dai, Yulia Tsvetkov, Yuan Cao

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Without utilizing extra data or task-specific customization, the resulting model significantly outperforms previous pretraining methods and achieves new state-of-the-art results on a wide range of discriminative and generative vision-language benchmarks, including VQA (+3.74% vqa-score), NLVR2 (+1.17% accuracy), SNLI-VE (+1.37% accuracy) and image captioning tasks (+10.1% average CIDEr score).
Researcher Affiliation Collaboration Zirui Wang1,2 , Jiahui Yu2, Adams Wei Yu2, Zihang Dai2, Yulia Tsvetkov3, Yuan Cao2 1Carnegie Mellon University {ziruiw}@cs.cmu.edu 2Google Research, Brain Team {jiahuiyu,adamsyuwei,zihangd,yuancao}@google.com 3University of Washington {yuliats}@cs.washington.edu
Pseudocode No The paper describes models and training procedures in text and figures, but does not include any pseudocode or algorithm blocks.
Open Source Code No The paper does not contain any statements about releasing source code or links to a code repository.
Open Datasets Yes All models are pretrained from scratch for about 1M steps on the training set of ALIGN (Jia et al., 2021) and the Colossal Clean Crawled Corpus (C4) dataset presented in Raffel et al. (2019).
Dataset Splits Yes We use the Adam W optimizer with the same Beta values, while we tune the learning rate in {1 10 5, 2 10 5, 5 10 5}. We also enable regularization methods of Dropout (set to 0.1) and stochastic depth (only applied to Conv stage and encoder with a fixed dropout rate of 0.1) (Huang et al., 2016) during the finetuning stage. Following standard practice, we use the corresponding dev split to find the best setting and report the result on the test split.
Hardware Specification Yes We mix the two pretraining datasets within each batch, which contains 4,096 image-text pairs and 512 text-only documents, sharded across 512 TPU v3 chips (Jouppi et al., 2017).
Software Dependencies No Our models are implemented with the Lingvo framework (Shen et al., 2019).
Experiment Setup Yes We use the Adam W optimizer (Loshchilov & Hutter, 2017) with β1 = 0.9, β2 = 0.999 and weight decay of 0.01. We warm up the learning rate for the first 2% of updates to a peak value of 5 10 4, and then linearly decay it afterwards. Dropout is not used during the pretraining stage. We mix the two pretraining datasets within each batch, which contains 4,096 image-text pairs and 512 text-only documents, sharded across 512 TPU v3 chips (Jouppi et al., 2017).