SimVLM: Simple Visual Language Model Pretraining with Weak Supervision
Authors: Zirui Wang, Jiahui Yu, Adams Wei Yu, Zihang Dai, Yulia Tsvetkov, Yuan Cao
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Without utilizing extra data or task-specific customization, the resulting model significantly outperforms previous pretraining methods and achieves new state-of-the-art results on a wide range of discriminative and generative vision-language benchmarks, including VQA (+3.74% vqa-score), NLVR2 (+1.17% accuracy), SNLI-VE (+1.37% accuracy) and image captioning tasks (+10.1% average CIDEr score). |
| Researcher Affiliation | Collaboration | Zirui Wang1,2 , Jiahui Yu2, Adams Wei Yu2, Zihang Dai2, Yulia Tsvetkov3, Yuan Cao2 1Carnegie Mellon University {ziruiw}@cs.cmu.edu 2Google Research, Brain Team {jiahuiyu,adamsyuwei,zihangd,yuancao}@google.com 3University of Washington {yuliats}@cs.washington.edu |
| Pseudocode | No | The paper describes models and training procedures in text and figures, but does not include any pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain any statements about releasing source code or links to a code repository. |
| Open Datasets | Yes | All models are pretrained from scratch for about 1M steps on the training set of ALIGN (Jia et al., 2021) and the Colossal Clean Crawled Corpus (C4) dataset presented in Raffel et al. (2019). |
| Dataset Splits | Yes | We use the Adam W optimizer with the same Beta values, while we tune the learning rate in {1 10 5, 2 10 5, 5 10 5}. We also enable regularization methods of Dropout (set to 0.1) and stochastic depth (only applied to Conv stage and encoder with a fixed dropout rate of 0.1) (Huang et al., 2016) during the finetuning stage. Following standard practice, we use the corresponding dev split to find the best setting and report the result on the test split. |
| Hardware Specification | Yes | We mix the two pretraining datasets within each batch, which contains 4,096 image-text pairs and 512 text-only documents, sharded across 512 TPU v3 chips (Jouppi et al., 2017). |
| Software Dependencies | No | Our models are implemented with the Lingvo framework (Shen et al., 2019). |
| Experiment Setup | Yes | We use the Adam W optimizer (Loshchilov & Hutter, 2017) with β1 = 0.9, β2 = 0.999 and weight decay of 0.01. We warm up the learning rate for the first 2% of updates to a peak value of 5 10 4, and then linearly decay it afterwards. Dropout is not used during the pretraining stage. We mix the two pretraining datasets within each batch, which contains 4,096 image-text pairs and 512 text-only documents, sharded across 512 TPU v3 chips (Jouppi et al., 2017). |