reproducibilityindex.ai

Unified Vision-Language Pre-Training for Image Captioning and VQA

Authors: Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong Hu, Jason Corso, Jianfeng Gao13041-13049

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We validate VLP in our experiments on both the image captioning and VQA tasks using three challenging benchmarks: COCO Captions (Chen et al. 2015), Flickr30k Captions (Young et al. 2014), and VQA 2.0 dataset (Goyal et al. 2017).
Researcher Affiliation	Collaboration	1University of Michigan, 2Microsoft Research, 3Microsoft Cloud & AI, 4Microsoft AI & Research
Pseudocode	No	The paper does not contain any pseudocode or algorithm blocks.
Open Source Code	Yes	The code and the pre-trained models are available at https://github.com/Luowei Zhou/VLP.
Open Datasets	Yes	We conduct pre-training on the Conceptual Captions (CC) dataset (Sharma et al. 2018) which has around 3 million web-accessible images with associated captions. The datasets for downstream tasks include COCO Captions (Chen et al. 2015), VQA 2.0 (Goyal et al. 2017) and Flickr30k (Young et al. 2014).
Dataset Splits	Yes	For COCO Captions and Flickr30k, we follow Karpathy s split1, which gives 113.2k/5k/5k and 29.8k/1k/1k images for train/val/test splits respectively. For VQA 2.0, we split the dataset with the ofﬁcial partition, i.e., 443.8k questions from 82.8k images for training, 214.4k questions from 40.5k images for validation and report the results on Test-Standard set through the ofﬁcial evaluation server.
Hardware Specification	Yes	Our VQA models are trained on 2x V100 GPUs, COCO Captions SCST training on 4x Titan Xp GPUs, and all others are on 8x V100 GPUs.
Software Dependencies	No	Our Transformer backbone is the same as BERT-base (Devlin et al. 2018).
Experiment Setup	Yes	We use the same training optimizer as in BERT (Devlin et al. 2018) and other training hyperparameters are in Tab. 8. Table 8: Model hyper-parameters and training speciﬁcations. Dataset Batch Size Learning Rate # of Epochs GPUs Time per Epoch CC 64(x8) 1e-4(x8) 30 8x V100 5hr COCO 64(x8) 3e-5(x8) 30 8x V100 12min VQA 2.0 64(x2) 2e-5(x2) 20 2x V100 32min Flickr30k 64(x8) 3e-5(x8) 30 8x V100 3min COCO (w/o pre-training) 64(x8) 3e-4(x8) 30 8x V100 12min COCO (SCST training) 16(x4) 1e-6(x4) 30 4x Titan Xp 3hr