Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts

Authors: Yan Zeng, Xinsong Zhang, Hang Li

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results show that X-VLM effectively leverages the learned multi-grained alignments to many downstream vision language tasks and consistently outperforms state-of-the-art methods.
Researcher Affiliation Industry 1Byte Dance AI Lab.
Pseudocode No The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code Yes 1The code and pre-trained models are available at https: //github.com/zengyan-97/X-VLM.
Open Datasets Yes COCO (Lin et al., 2014) and Visual Genome (VG) (Krishna et al., 2017), and two out-of-domain datasets, SBU Captions (Ordonez et al., 2011) and Conceptual Captions (CC) (Sharma et al., 2018). ... In the 16M setting, we exploit a much noisier Conceptual 12M dataset (CC-12M) (Changpinyo et al., 2021) following ALBEF (Li et al., 2021a). We additionally exploit Objects365 (Shao et al., 2019) and Open Images (Kuznetsova et al., 2018) following Vin VL (Zhang et al., 2021).
Dataset Splits Yes We adopt the widely used Karpathy split (Karpathy & Li, 2015) for both datasets. ... Following the previous work (Cho et al., 2021; Li et al., 2021a), we use both train and validation sets for training, and include additional question-answer pairs from Visual Genome.
Hardware Specification Yes In the 4M setting, we train the model for 200K steps on 8 NVIDIA A100 GPUs and the batch size is set to 1024, which tasks 3.5 days.
Software Dependencies No The paper mentions software components like 'Python', 'BERTbase', and 'Swin Transformerbase' but does not specify their version numbers or other specific software dependencies required for reproducibility.
Experiment Setup Yes We apply mixed precision for pre-training. In the 4M setting, we train the model for 200K steps on 8 NVIDIA A100 GPUs and the batch size is set to 1024, which tasks 3.5 days. In the 16M setting, we train the model on 24 GPUs with a batch size of 3072. We sample the data by making half of the images in a batch containing bounding box annotations. We use the Adam W (Loshchilov & Hutter, 2019) optimizer with a weight decay of 0.02. The learning rate is warmed-up to 1e 4 from 1e 5 in the first 2500 steps and decayed to 1e 5 following a linear schedule.