Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts
Authors: Yan Zeng, Xinsong Zhang, Hang Li
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results show that X-VLM effectively leverages the learned multi-grained alignments to many downstream vision language tasks and consistently outperforms state-of-the-art methods. |
| Researcher Affiliation | Industry | 1Byte Dance AI Lab. |
| Pseudocode | No | The paper does not contain any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | Yes | 1The code and pre-trained models are available at https: //github.com/zengyan-97/X-VLM. |
| Open Datasets | Yes | COCO (Lin et al., 2014) and Visual Genome (VG) (Krishna et al., 2017), and two out-of-domain datasets, SBU Captions (Ordonez et al., 2011) and Conceptual Captions (CC) (Sharma et al., 2018). ... In the 16M setting, we exploit a much noisier Conceptual 12M dataset (CC-12M) (Changpinyo et al., 2021) following ALBEF (Li et al., 2021a). We additionally exploit Objects365 (Shao et al., 2019) and Open Images (Kuznetsova et al., 2018) following Vin VL (Zhang et al., 2021). |
| Dataset Splits | Yes | We adopt the widely used Karpathy split (Karpathy & Li, 2015) for both datasets. ... Following the previous work (Cho et al., 2021; Li et al., 2021a), we use both train and validation sets for training, and include additional question-answer pairs from Visual Genome. |
| Hardware Specification | Yes | In the 4M setting, we train the model for 200K steps on 8 NVIDIA A100 GPUs and the batch size is set to 1024, which tasks 3.5 days. |
| Software Dependencies | No | The paper mentions software components like 'Python', 'BERTbase', and 'Swin Transformerbase' but does not specify their version numbers or other specific software dependencies required for reproducibility. |
| Experiment Setup | Yes | We apply mixed precision for pre-training. In the 4M setting, we train the model for 200K steps on 8 NVIDIA A100 GPUs and the batch size is set to 1024, which tasks 3.5 days. In the 16M setting, we train the model on 24 GPUs with a batch size of 3072. We sample the data by making half of the images in a batch containing bounding box annotations. We use the Adam W (Loshchilov & Hutter, 2019) optimizer with a weight decay of 0.02. The learning rate is warmed-up to 1e 4 from 1e 5 in the first 2500 steps and decayed to 1e 5 following a linear schedule. |