reproducibilityindex.ai

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

Authors: Junnan Li, Dongxu Li, Caiming Xiong, Steven Hoi

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We perform extensive experiments and analysis, and make the following key observations. BLIP achieves state-of-the-art performance on a wide range of vision-language tasks, including image-text retrieval, image captioning, visual question answering, visual reasoning, and visual dialog.
Researcher Affiliation	Industry	1Salesforce Research. Correspondence to: Junnan Li <junnan.li@salesforce.com>.
Pseudocode	No	The paper describes methods and architectures but does not include any explicit pseudocode or algorithm blocks.
Open Source Code	Yes	Code and models are available at https://github. com/salesforce/BLIP.
Open Datasets	Yes	We use the same pre-training dataset as Li et al. (2021a) with 14M images in total, including two human-annotated datasets (COCO and Visual Genome (Krishna et al., 2017)), and three web datasets (Conceptual Captions (Changpinyo et al., 2021), Conceptual 12M (Changpinyo et al., 2021), SBU captions (Ordonez et al., 2011)).
Dataset Splits	Yes	We use the Karpathy split (Karpathy & Li, 2015) for both COCO and Flickr30K. COCO contains 113/5k/5k images for train/validation/test, and Flickr30K contains 29k/1k/1k images for train/validation/test.
Hardware Specification	No	Our models are implemented in Py Torch (Paszke et al., 2019) and pre-trained on two 16-GPU nodes.
Software Dependencies	No	Our models are implemented in Py Torch (Paszke et al., 2019)...
Experiment Setup	Yes	We pre-train the model for 20 epochs using a batch size of 2880 (Vi T-B) / 2400 (Vi T-L). We use Adam W (Loshchilov & Hutter, 2017) optimizer with a weight decay of 0.05. The learning rate is warmed-up to 3e-4 (Vi T-B) / 2e-4 (Vi T-L) and decayed linearly with a rate of 0.85. We take random image crops of resolution 224 224 during pre-training, and increase the image resolution to 384 384 during finetuning. Table 14 shows finetuning hyperparameters for downstream tasks (e.g., Retrieval: init LR 1e-5 (5e-6), batch size 256, #epoch 6).