Unified Vision-Language Pre-Training for Image Captioning and VQA

Authors: Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong Hu, Jason Corso, Jianfeng Gao13041-13049

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We validate VLP in our experiments on both the image captioning and VQA tasks using three challenging benchmarks: COCO Captions (Chen et al. 2015), Flickr30k Captions (Young et al. 2014), and VQA 2.0 dataset (Goyal et al. 2017).
Researcher Affiliation Collaboration 1University of Michigan, 2Microsoft Research, 3Microsoft Cloud & AI, 4Microsoft AI & Research
Pseudocode No The paper does not contain any pseudocode or algorithm blocks.
Open Source Code Yes The code and the pre-trained models are available at https://github.com/Luowei Zhou/VLP.
Open Datasets Yes We conduct pre-training on the Conceptual Captions (CC) dataset (Sharma et al. 2018) which has around 3 million web-accessible images with associated captions. The datasets for downstream tasks include COCO Captions (Chen et al. 2015), VQA 2.0 (Goyal et al. 2017) and Flickr30k (Young et al. 2014).
Dataset Splits Yes For COCO Captions and Flickr30k, we follow Karpathy s split1, which gives 113.2k/5k/5k and 29.8k/1k/1k images for train/val/test splits respectively. For VQA 2.0, we split the dataset with the official partition, i.e., 443.8k questions from 82.8k images for training, 214.4k questions from 40.5k images for validation and report the results on Test-Standard set through the official evaluation server.
Hardware Specification Yes Our VQA models are trained on 2x V100 GPUs, COCO Captions SCST training on 4x Titan Xp GPUs, and all others are on 8x V100 GPUs.
Software Dependencies No Our Transformer backbone is the same as BERT-base (Devlin et al. 2018).
Experiment Setup Yes We use the same training optimizer as in BERT (Devlin et al. 2018) and other training hyperparameters are in Tab. 8. Table 8: Model hyper-parameters and training specifications. Dataset Batch Size Learning Rate # of Epochs GPUs Time per Epoch CC 64(x8) 1e-4(x8) 30 8x V100 5hr COCO 64(x8) 3e-5(x8) 30 8x V100 12min VQA 2.0 64(x2) 2e-5(x2) 20 2x V100 32min Flickr30k 64(x8) 3e-5(x8) 30 8x V100 3min COCO (w/o pre-training) 64(x8) 3e-4(x8) 30 8x V100 12min COCO (SCST training) 16(x4) 1e-6(x4) 30 4x Titan Xp 3hr