BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

Authors: Junnan Li, Dongxu Li, Caiming Xiong, Steven Hoi

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We perform extensive experiments and analysis, and make the following key observations. BLIP achieves state-of-the-art performance on a wide range of vision-language tasks, including image-text retrieval, image captioning, visual question answering, visual reasoning, and visual dialog.
Researcher Affiliation Industry 1Salesforce Research. Correspondence to: Junnan Li <junnan.li@salesforce.com>.
Pseudocode No The paper describes methods and architectures but does not include any explicit pseudocode or algorithm blocks.
Open Source Code Yes Code and models are available at https://github. com/salesforce/BLIP.
Open Datasets Yes We use the same pre-training dataset as Li et al. (2021a) with 14M images in total, including two human-annotated datasets (COCO and Visual Genome (Krishna et al., 2017)), and three web datasets (Conceptual Captions (Changpinyo et al., 2021), Conceptual 12M (Changpinyo et al., 2021), SBU captions (Ordonez et al., 2011)).
Dataset Splits Yes We use the Karpathy split (Karpathy & Li, 2015) for both COCO and Flickr30K. COCO contains 113/5k/5k images for train/validation/test, and Flickr30K contains 29k/1k/1k images for train/validation/test.
Hardware Specification No Our models are implemented in Py Torch (Paszke et al., 2019) and pre-trained on two 16-GPU nodes.
Software Dependencies No Our models are implemented in Py Torch (Paszke et al., 2019)...
Experiment Setup Yes We pre-train the model for 20 epochs using a batch size of 2880 (Vi T-B) / 2400 (Vi T-L). We use Adam W (Loshchilov & Hutter, 2017) optimizer with a weight decay of 0.05. The learning rate is warmed-up to 3e-4 (Vi T-B) / 2e-4 (Vi T-L) and decayed linearly with a rate of 0.85. We take random image crops of resolution 224 224 during pre-training, and increase the image resolution to 384 384 during finetuning. Table 14 shows finetuning hyperparameters for downstream tasks (e.g., Retrieval: init LR 1e-5 (5e-6), batch size 256, #epoch 6).