BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Authors: Junnan Li, Dongxu Li, Silvio Savarese, Steven Hoi

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental BLIP-2 achieves state-of-the-art performance on various vision-language tasks, despite having significantly fewer trainable parameters than existing methods. For example, our model outperforms Flamingo80B by 8.7% on zero-shot VQAv2 with 54x fewer trainable parameters. We also demonstrate the model s capabilities of zero-shot image-to-text generation that can follow natural language instructions. Section 4. Experiment: Table 1 provides an overview of the performance of BLIP-2 on various zero-shot vision-language tasks. Compared with previous state-of-the-art models. BLIP-2 achieves the highest zero-shot performance while requiring the least number of trainable parameters during vision-language pre-training.
Researcher Affiliation Industry Junnan Li 1 Dongxu Li 1 Silvio Savarese 1 Steven Hoi 1 1Salesforce Research.
Pseudocode No No structured pseudocode or algorithm blocks (e.g., labeled "Pseudocode" or "Algorithm") were found in the paper.
Open Source Code Yes https://github.com/salesforce/LAVIS/tree/main/projects/blip2
Open Datasets Yes We use the same pre-training dataset as BLIP with 129M images in total, including COCO (Lin et al., 2014), Visual Genome (Krishna et al., 2017), CC3M (Sharma et al., 2018), CC12M (Changpinyo et al., 2021), SBU (Ordonez et al., 2011), and 115M images from the LAION400M dataset (Schuhmann et al., 2021).
Dataset Splits Yes Table 3. Comparison with state-of-the-art image captioning methods on No Caps and COCO Caption. No Caps Zero-shot (validation set). Table 2. Comparison with state-of-the-art methods on zero-shot visual question answering. VQAv2 val. Section 4.3 Visual Question Answering: Following BLIP, our VQA data includes the training and validation splits from VQAv2, as well as training samples from Visual Genome.
Hardware Specification Yes For example, using a single 16-A100(40G) machine, our largest model with Vi T-g and Flan T5-XXL requires less than 6 days for the first stage and less than 3 days for the second stage.
Software Dependencies No Specific software dependencies with version numbers were not provided. The paper mentions using Adam W optimizer and pre-trained models like BERTbase, OPT, and Flan T5, but not the software environment or libraries with their versions (e.g., PyTorch 1.x, Python 3.x).
Experiment Setup Yes We pre-train for 250k steps in the first stage and 80k steps in the second stage. We use a batch size of 2320/1680 for Vi T-L/Vi T-g in the first stage and a batch size of 1920/1520 for OPT/Flan T5 in the second stage. We use the Adam W (Loshchilov & Hutter, 2017) optimizer with β1 = 0.9, β1 = 0.98, and a weight decay of 0.05. We use a cosine learning rate decay with a peak learning rate of 1e-4 and a linear warmup of 2k steps. The minimum learning rate at the second stage is 5e-5. We use images of size 224 224, augmented with random resized cropping and horizontal flipping. Table 7, 8, 9 also provide detailed hyperparameters for fine-tuning including Fine-tuning epochs, Warmup steps, Learning rate, Batch size, Adam W β, Weight decay, Drop path, Image resolution, and Inference beam size.