EVE: Efficient Vision-Language Pre-training with Masked Prediction and Modality-Aware MoE

Authors: Junyi Chen, Longteng Guo, Jia Sun, Shuai Shao, Zehuan Yuan, Liang Lin, Dongyu Zhang

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the effectiveness of EVE on various vision-language downstream tasks, including visual question answering, visual reasoning, and image-text retrieval. EVE achieves stateof-the-art performance on Image-Text Retrieval and Vision Language Understanding (VQA and NLVR2) tasks. Our contributions are summarized as follows: We introduce EVE, an efficient vision-language foundation model that achieves state-of-the-art performance while improving training speed, with one unified multimodal Transformer and one unified pre-training task. We integrate Modality-Aware Mo E with a shared multimodal Transformer to achieve a more profound fusion of different modalities and capture more modality-specific information simultaneously, resulting in better performance and faster inference speed within a unified architecture. We propose a unified masked signal modeling technique, simplifying vision-language pre-training into a single unified objective, resulting in significantly improved pretraining speed and competitive performance.
Researcher Affiliation Collaboration 1Sun Yat-sen University 2Institute of Automation, Chinese Academy of Sciences (CASIA) 3Bytedance Inc
Pseudocode No The paper describes the model architecture and training process in detail, including figures, but does not provide any pseudocode or algorithm blocks.
Open Source Code No The paper does not provide a statement about releasing source code or a link to a code repository.
Open Datasets Yes Following Previous methods, we pre-train EVE on four widely used public datasets: MSCOCO Captions (Lin et al. 2014), Visual Genome (Krishna et al. 2017), SBU Captions (Ordonez, Kulkarni, and Berg 2011) and Conceptual Captions (Sharma et al. 2018).
Dataset Splits Yes We evaluate our model on VQA2.0 dataset (Goyal et al. 2017) to evaluate our model... We evaluate our model on NLVR2 dataset (Suhr et al. 2019)... MIM Target NLVR2 Flickr30K VQA dev test-P TR IR (Table 3)
Hardware Specification Yes Training hours of all models are reproduced by us on A100 GPUs.
Software Dependencies No The paper mentions using the 'Adam W (Loshchilov and Hutter 2019) optimizer' but does not specify version numbers for any other software dependencies such as programming languages, deep learning frameworks, or libraries.
Experiment Setup Yes EVE-Base has 12 Transformer blocks and EVE-Large has 24 Transformer blocks. We employ a soft router with 32 experts in EVE-Base on top-2 blocks, EVE-Large on top-3 blocks, and a hard router on the other blocks. We pretrain EVE-Base for 480k steps with a batch size of 2048 and EVE-Large with the same batch size for 280k steps. We use Adam W (Loshchilov and Hutter 2019) optimizer. The peak learning rate is 5e-4 for EVE-Base and 2e-4 for EVE-Large. During pre-training, the image resolution is 224 224. We use random resized cropping and horizontal flipping for data augmentation. We mask 75% of image in MIM and 50% of text in MLM.