Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts
Authors: Yan Zeng, Xinsong Zhang, Hang Li
ICML 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results show that X-VLM effectively leverages the learned multi-grained alignments to many downstream vision language tasks and consistently outperforms state-of-the-art methods. |
| Researcher Affiliation | Industry | 1Byte Dance AI Lab. |
| Pseudocode | No | The paper does not contain any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | Yes | 1The code and pre-trained models are available at https: //github.com/zengyan-97/X-VLM. |
| Open Datasets | Yes | COCO (Lin et al., 2014) and Visual Genome (VG) (Krishna et al., 2017), and two out-of-domain datasets, SBU Captions (Ordonez et al., 2011) and Conceptual Captions (CC) (Sharma et al., 2018). ... In the 16M setting, we exploit a much noisier Conceptual 12M dataset (CC-12M) (Changpinyo et al., 2021) following ALBEF (Li et al., 2021a). We additionally exploit Objects365 (Shao et al., 2019) and Open Images (Kuznetsova et al., 2018) following Vin VL (Zhang et al., 2021). |
| Dataset Splits | Yes | We adopt the widely used Karpathy split (Karpathy & Li, 2015) for both datasets. ... Following the previous work (Cho et al., 2021; Li et al., 2021a), we use both train and validation sets for training, and include additional question-answer pairs from Visual Genome. |
| Hardware Specification | Yes | In the 4M setting, we train the model for 200K steps on 8 NVIDIA A100 GPUs and the batch size is set to 1024, which tasks 3.5 days. |
| Software Dependencies | No | The paper mentions software components like 'Python', 'BERTbase', and 'Swin Transformerbase' but does not specify their version numbers or other specific software dependencies required for reproducibility. |
| Experiment Setup | Yes | We apply mixed precision for pre-training. In the 4M setting, we train the model for 200K steps on 8 NVIDIA A100 GPUs and the batch size is set to 1024, which tasks 3.5 days. In the 16M setting, we train the model on 24 GPUs with a batch size of 3072. We sample the data by making half of the images in a batch containing bounding box annotations. We use the Adam W (Loshchilov & Hutter, 2019) optimizer with a weight decay of 0.02. The learning rate is warmed-up to 1e 4 from 1e 5 in the first 2500 steps and decayed to 1e 5 following a linear schedule. |