VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts
Authors: Hangbo Bao, Wenhui Wang, Li Dong, Qiang Liu, Owais Khan Mohammed, Kriti Aggarwal, Subhojit Som, Songhao Piao, Furu Wei
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results show that VLMO achieves state-of-the-art results on various vision-language tasks, including VQA, NLVR2 and image-text retrieval. |
| Researcher Affiliation | Collaboration | Hangbo Bao1 , Wenhui Wang2, Li Dong2, Qiang Liu2, Owais Khan Mohammed2, Kriti Aggarwal2, Subhojit Som2, Songhao Piao1, Furu Wei2 1Harbin Institute of Technology, 2Microsoft Corporation |
| Pseudocode | No | The paper does not include any explicit pseudocode or algorithm blocks. It uses diagrams and descriptive text to explain the model architecture. |
| Open Source Code | Yes | The code and pretrained models are available at http://aka.ms/vlmo. |
| Open Datasets | Yes | Following previous work [4, 21], our pre-training data consists of four image captioning datasets: Conceptual Captions (CC) [40], SBU Captions [33], COCO [28] and Visual Genome (VG) [22] datasets. |
| Dataset Splits | Yes | We report vqa-score on VQA test-dev and test-standard split, and report accuracy for NLVR2 development and public test set (test-P). |
| Hardware Specification | Yes | We perform experiments using 32 V100 GPUs for the base-size model. The batch size per GPU is 32, and the total batch size is 1024. |
| Software Dependencies | No | The paper mentions software components like BERT's tokenizer and Adam W optimizer, but does not provide specific version numbers for any software dependencies like PyTorch, TensorFlow, or Python libraries. |
| Experiment Setup | Yes | Our models adopt the same network configuration as Vi T [13] and BEIT [3]. VLMO-Base consists of 12-layer Transformer blocks with 768 hidden size and 12 attention heads. VLMO-Large is a 24-layer Transformer network with 1024 hidden size and 16 attention heads. VLMO-Base uses vision-language expert on the top two Transformer layers, and VLMO-Large introduces vision-language expert on the top three layers. VLMO-Base consists of 175M parameters and VLMO-Large contains 562M parameters. For images, the input resolution is 224 224 and the patch size is 16 16 during pre-training. We apply Rand Augment [10] to the input images. The tokenizer of the uncased version of BERT is employed to tokenize the text. The maximum text sequence length is set to 40. We also employ whole word masking for the masked language modeling pre-training task. We pretrain the models for 200k steps with 1024 batch size. We utilize Adam W [30] optimizer with β1 = 0.9, β2 = 0.98. The peak learning is 2e-4 for the base-size model, 5e-5 for the large-size model. Weight decay is set to 0.01. We use linear warmup over the first 2.5k steps and linear decay. |