reproducibilityindex.ai

SimVLM: Simple Visual Language Model Pretraining with Weak Supervision

Authors: Zirui Wang, Jiahui Yu, Adams Wei Yu, Zihang Dai, Yulia Tsvetkov, Yuan Cao

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Without utilizing extra data or task-specific customization, the resulting model significantly outperforms previous pretraining methods and achieves new state-of-the-art results on a wide range of discriminative and generative vision-language benchmarks, including VQA (+3.74% vqa-score), NLVR2 (+1.17% accuracy), SNLI-VE (+1.37% accuracy) and image captioning tasks (+10.1% average CIDEr score).
Researcher Affiliation	Collaboration	Zirui Wang1,2 , Jiahui Yu2, Adams Wei Yu2, Zihang Dai2, Yulia Tsvetkov3, Yuan Cao2 1Carnegie Mellon University {ziruiw}@cs.cmu.edu 2Google Research, Brain Team {jiahuiyu,adamsyuwei,zihangd,yuancao}@google.com 3University of Washington {yuliats}@cs.washington.edu
Pseudocode	No	The paper describes models and training procedures in text and figures, but does not include any pseudocode or algorithm blocks.
Open Source Code	No	The paper does not contain any statements about releasing source code or links to a code repository.
Open Datasets	Yes	All models are pretrained from scratch for about 1M steps on the training set of ALIGN (Jia et al., 2021) and the Colossal Clean Crawled Corpus (C4) dataset presented in Raffel et al. (2019).
Dataset Splits	Yes	We use the Adam W optimizer with the same Beta values, while we tune the learning rate in {1 10 5, 2 10 5, 5 10 5}. We also enable regularization methods of Dropout (set to 0.1) and stochastic depth (only applied to Conv stage and encoder with a fixed dropout rate of 0.1) (Huang et al., 2016) during the finetuning stage. Following standard practice, we use the corresponding dev split to find the best setting and report the result on the test split.
Hardware Specification	Yes	We mix the two pretraining datasets within each batch, which contains 4,096 image-text pairs and 512 text-only documents, sharded across 512 TPU v3 chips (Jouppi et al., 2017).
Software Dependencies	No	Our models are implemented with the Lingvo framework (Shen et al., 2019).
Experiment Setup	Yes	We use the Adam W optimizer (Loshchilov & Hutter, 2017) with β1 = 0.9, β2 = 0.999 and weight decay of 0.01. We warm up the learning rate for the first 2% of updates to a peak value of 5 10 4, and then linearly decay it afterwards. Dropout is not used during the pretraining stage. We mix the two pretraining datasets within each batch, which contains 4,096 image-text pairs and 512 text-only documents, sharded across 512 TPU v3 chips (Jouppi et al., 2017).