VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts

Authors: Hangbo Bao, Wenhui Wang, Li Dong, Qiang Liu, Owais Khan Mohammed, Kriti Aggarwal, Subhojit Som, Songhao Piao, Furu Wei

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results show that VLMO achieves state-of-the-art results on various vision-language tasks, including VQA, NLVR2 and image-text retrieval.
Researcher Affiliation Collaboration Hangbo Bao1 , Wenhui Wang2, Li Dong2, Qiang Liu2, Owais Khan Mohammed2, Kriti Aggarwal2, Subhojit Som2, Songhao Piao1, Furu Wei2 1Harbin Institute of Technology, 2Microsoft Corporation
Pseudocode No The paper does not include any explicit pseudocode or algorithm blocks. It uses diagrams and descriptive text to explain the model architecture.
Open Source Code Yes The code and pretrained models are available at http://aka.ms/vlmo.
Open Datasets Yes Following previous work [4, 21], our pre-training data consists of four image captioning datasets: Conceptual Captions (CC) [40], SBU Captions [33], COCO [28] and Visual Genome (VG) [22] datasets.
Dataset Splits Yes We report vqa-score on VQA test-dev and test-standard split, and report accuracy for NLVR2 development and public test set (test-P).
Hardware Specification Yes We perform experiments using 32 V100 GPUs for the base-size model. The batch size per GPU is 32, and the total batch size is 1024.
Software Dependencies No The paper mentions software components like BERT's tokenizer and Adam W optimizer, but does not provide specific version numbers for any software dependencies like PyTorch, TensorFlow, or Python libraries.
Experiment Setup Yes Our models adopt the same network configuration as Vi T [13] and BEIT [3]. VLMO-Base consists of 12-layer Transformer blocks with 768 hidden size and 12 attention heads. VLMO-Large is a 24-layer Transformer network with 1024 hidden size and 16 attention heads. VLMO-Base uses vision-language expert on the top two Transformer layers, and VLMO-Large introduces vision-language expert on the top three layers. VLMO-Base consists of 175M parameters and VLMO-Large contains 562M parameters. For images, the input resolution is 224 224 and the patch size is 16 16 during pre-training. We apply Rand Augment [10] to the input images. The tokenizer of the uncased version of BERT is employed to tokenize the text. The maximum text sequence length is set to 40. We also employ whole word masking for the masked language modeling pre-training task. We pretrain the models for 200k steps with 1024 batch size. We utilize Adam W [30] optimizer with β1 = 0.9, β2 = 0.98. The peak learning is 2e-4 for the base-size model, 5e-5 for the large-size model. Weight decay is set to 0.01. We use linear warmup over the first 2.5k steps and linear decay.