BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
Authors: Junnan Li, Dongxu Li, Caiming Xiong, Steven Hoi
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We perform extensive experiments and analysis, and make the following key observations. BLIP achieves state-of-the-art performance on a wide range of vision-language tasks, including image-text retrieval, image captioning, visual question answering, visual reasoning, and visual dialog. |
| Researcher Affiliation | Industry | 1Salesforce Research. Correspondence to: Junnan Li <junnan.li@salesforce.com>. |
| Pseudocode | No | The paper describes methods and architectures but does not include any explicit pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code and models are available at https://github. com/salesforce/BLIP. |
| Open Datasets | Yes | We use the same pre-training dataset as Li et al. (2021a) with 14M images in total, including two human-annotated datasets (COCO and Visual Genome (Krishna et al., 2017)), and three web datasets (Conceptual Captions (Changpinyo et al., 2021), Conceptual 12M (Changpinyo et al., 2021), SBU captions (Ordonez et al., 2011)). |
| Dataset Splits | Yes | We use the Karpathy split (Karpathy & Li, 2015) for both COCO and Flickr30K. COCO contains 113/5k/5k images for train/validation/test, and Flickr30K contains 29k/1k/1k images for train/validation/test. |
| Hardware Specification | No | Our models are implemented in Py Torch (Paszke et al., 2019) and pre-trained on two 16-GPU nodes. |
| Software Dependencies | No | Our models are implemented in Py Torch (Paszke et al., 2019)... |
| Experiment Setup | Yes | We pre-train the model for 20 epochs using a batch size of 2880 (Vi T-B) / 2400 (Vi T-L). We use Adam W (Loshchilov & Hutter, 2017) optimizer with a weight decay of 0.05. The learning rate is warmed-up to 3e-4 (Vi T-B) / 2e-4 (Vi T-L) and decayed linearly with a rate of 0.85. We take random image crops of resolution 224 224 during pre-training, and increase the image resolution to 384 384 during finetuning. Table 14 shows finetuning hyperparameters for downstream tasks (e.g., Retrieval: init LR 1e-5 (5e-6), batch size 256, #epoch 6). |