mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video
Authors: Haiyang Xu, Qinghao Ye, Ming Yan, Yaya Shi, Jiabo Ye, Yuanhong Xu, Chenliang Li, Bin Bi, Qi Qian, Wei Wang, Guohai Xu, Ji Zhang, Songfang Huang, Fei Huang, Jingren Zhou
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical study shows that m PLUG-2 achieves state-of-the-art or competitive results on a broad range of over 30 downstream tasks, spanning multi-modal tasks of image-text and video-text understanding and generation, and uni-modal tasks of text-only, image-only, and video-only understanding. |
| Researcher Affiliation | Industry | 1DAMO Academy, Alibaba Group, China. |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. Figure 2 is a diagram illustrating the framework and modules, not a procedural algorithm. |
| Open Source Code | Yes | Code and models will be released in https://github.com/X-PLUG/mPLUG2. |
| Open Datasets | Yes | Following previous works (Li et al., 2021b; 2022a), we pre-train our model with the same popular image-text datasets with 14M images including MS COCO (Lin et al., 2014), Visual Genome (Krishna et al., 2017), Conceptual Captions 3M (Sharma et al., 2018), Conceptual Captions 12M (Changpinyo et al., 2021), and SBU Captions (Ordonez et al., 2011). For video-text datasets, we adopt the web-sourced video dataset Web Vid-2M (Bain et al., 2021a) with 2.5M video-text pairs. The text datasets consists of Wiki Corpus (Devlin et al., 2018) (about 20GB) and cleaned common crawl (about 350GB). |
| Dataset Splits | Yes | COCO has over 330k images and 5 independent human generated captions are be provided for each image. It contains 113k/5k/5k images for training/validation/testing. Flickr30K contains 31k images from Flickr, each image with 5 human annotated sentences. It contains 29k/1k/1k images for training/validation/testing. |
| Hardware Specification | Yes | We pre-train the model for 30 epochs with the total batch size of 1024 on 8 NVIDIA A100 GPUs for m PLUG-2Base and batch size of 512 on 16 NVIDIA A100 GPUs. |
| Software Dependencies | No | The paper mentions 'Our models are implemented in the Py Torch framework (Paszke et al., 2019)' but does not provide specific version numbers for PyTorch or other key software libraries. |
| Experiment Setup | Yes | We pre-train the model for 30 epochs with the total batch size of 1024 on 8 NVIDIA A100 GPUs for m PLUG-2Base and batch size of 512 on 16 NVIDIA A100 GPUs. We use Adam W (Loshchilov & Hutter, 2019) optimizer with the weight decay factor 0.02 and betas (0.9, 0.98) for stabilizing the learning. The learning rate is firstly warmed up to lrmax in the first 5000 iterations then decays following the cosine annealing schedule. lrmax is set to 1e-4 for m PLUG-2Base and 5e-5 for m PLUG-2. During the pre-training, we randomly crop the images and video frames into 224x224 resolution and sparsely sample 4 frames for each video while preserving their order in-between. For vision-text contrastive learning, the queue size and the momentum coefficient are set to 65,536 and 0.995 respectively. |