InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
Authors: Wenliang Dai, Junnan Li, DONGXU LI, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, Steven Hoi
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We gather 26 publicly available datasets, covering a wide variety of tasks and capabilities, and transform them into instruction tuning format. Additionally, we introduce an instruction-aware Query Transformer, which extracts informative features tailored to the given instruction. Trained on 13 held-in datasets, Instruct BLIP attains state-of-the-art zero-shot performance across all 13 held-out datasets, substantially outperforming BLIP-2 and larger Flamingo models. Our models also lead to state-of-the-art performance when finetuned on individual downstream tasks (e.g., 90.7% accuracy on Science QA questions with image contexts). |
| Researcher Affiliation | Collaboration | 1Salesforce Research 2Hong Kong University of Science and Technology 3Nanyang Technological University, Singapore |
| Pseudocode | No | The paper includes architectural diagrams and descriptions but does not provide structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | All Instruct BLIP models are open-source. https://github.com/salesforce/LAVIS/tree/main/projects/instructblip |
| Open Datasets | Yes | To ensure the diversity of instruction tuning data while considering their accessibility, we gather comprehensive set of publicly available vision-language datasets, and transform them into the instruction tuning format. As shown in Figure 2, the final collection covers 11 task categories and 26 datasets, including image captioning [10, 11, 12], image captioning with reading comprehension [13], visual reasoning [14, 15, 16], image question answering [17, 18], knowledge-grounded image question answering [19, 20, 21], image question answering with reading comprehension [22, 23], image question generation (adapted from the QA datasets), video question answering [24, 25], visual conversational question answering [26], image classification [27], and LLa VA-Instruct-150K [28]. |
| Dataset Splits | Yes | To ensure sufficient data and tasks for training and zero-shot evaluation, we divide the 26 datasets into 13 held-in datasets and 13 held-out datasets, indicated by yellow and white respectively in Figure 2. We employ the training sets of the held-in datasets for instruction tuning and their validation or test sets for held-in evaluation. |
| Hardware Specification | Yes | All models are trained utilizing 16 Nvidia A100 (40G) GPUs and are completed within 1.5 days. |
| Software Dependencies | No | The paper mentions using the 'LAVIS library [32]' and 'Adam W [33] optimizer' but does not specify their version numbers. |
| Experiment Setup | Yes | We employ a batch size of 192, 128, and 64 for the 3B, 7B, and 11/13B models, respectively. The Adam W [33] optimizer is used, with β1 = 0.9, β2 = 0.999, and a weight decay of 0.05. Additionally, we apply a linear warmup of the learning rate during the initial 1,000 steps, increasing from 10 8 to 10 5, followed by a cosine decay with a minimum learning rate of 0. For decoding, we adopt beam search with a beam size of 1 for Hateful Memes, VSR, and OCR-VQA, 3 for No Caps, and 5 for the other tasks. |