InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

Authors: Wenliang Dai, Junnan Li, DONGXU LI, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, Steven Hoi

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We gather 26 publicly available datasets, covering a wide variety of tasks and capabilities, and transform them into instruction tuning format. Additionally, we introduce an instruction-aware Query Transformer, which extracts informative features tailored to the given instruction. Trained on 13 held-in datasets, Instruct BLIP attains state-of-the-art zero-shot performance across all 13 held-out datasets, substantially outperforming BLIP-2 and larger Flamingo models. Our models also lead to state-of-the-art performance when finetuned on individual downstream tasks (e.g., 90.7% accuracy on Science QA questions with image contexts).
Researcher Affiliation Collaboration 1Salesforce Research 2Hong Kong University of Science and Technology 3Nanyang Technological University, Singapore
Pseudocode No The paper includes architectural diagrams and descriptions but does not provide structured pseudocode or algorithm blocks.
Open Source Code Yes All Instruct BLIP models are open-source. https://github.com/salesforce/LAVIS/tree/main/projects/instructblip
Open Datasets Yes To ensure the diversity of instruction tuning data while considering their accessibility, we gather comprehensive set of publicly available vision-language datasets, and transform them into the instruction tuning format. As shown in Figure 2, the final collection covers 11 task categories and 26 datasets, including image captioning [10, 11, 12], image captioning with reading comprehension [13], visual reasoning [14, 15, 16], image question answering [17, 18], knowledge-grounded image question answering [19, 20, 21], image question answering with reading comprehension [22, 23], image question generation (adapted from the QA datasets), video question answering [24, 25], visual conversational question answering [26], image classification [27], and LLa VA-Instruct-150K [28].
Dataset Splits Yes To ensure sufficient data and tasks for training and zero-shot evaluation, we divide the 26 datasets into 13 held-in datasets and 13 held-out datasets, indicated by yellow and white respectively in Figure 2. We employ the training sets of the held-in datasets for instruction tuning and their validation or test sets for held-in evaluation.
Hardware Specification Yes All models are trained utilizing 16 Nvidia A100 (40G) GPUs and are completed within 1.5 days.
Software Dependencies No The paper mentions using the 'LAVIS library [32]' and 'Adam W [33] optimizer' but does not specify their version numbers.
Experiment Setup Yes We employ a batch size of 192, 128, and 64 for the 3B, 7B, and 11/13B models, respectively. The Adam W [33] optimizer is used, with β1 = 0.9, β2 = 0.999, and a weight decay of 0.05. Additionally, we apply a linear warmup of the learning rate during the initial 1,000 steps, increasing from 10 8 to 10 5, followed by a cosine decay with a minimum learning rate of 0. For decoding, we adopt beam search with a beam size of 1 for Hateful Memes, VSR, and OCR-VQA, 3 for No Caps, and 5 for the other tasks.