Visual Instruction Tuning

Authors: Haotian Liu, Chunyuan Li, Qingyang Wu, Yong Jae Lee

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments show that LLa VA demonstrates impressive multimodal chat abilities, sometimes exhibiting the behaviors of multimodal GPT-4 on unseen images/instructions, and yields a 85.1% relative score compared with GPT-4 on a synthetic multimodal instruction-following dataset. When fine-tuned on Science QA, the synergy of LLa VA and GPT-4 achieves a new state-of-the-art accuracy of 92.53%. We assess the performance of LLa VA in instruction-following and visual reasoning capabilities with two primary experimental settings: multimodal chatbot and the Science QA dataset, respectively.
Researcher Affiliation Collaboration Haotian Liu1 , Chunyuan Li2 , Qingyang Wu3, Yong Jae Lee1 1University of Wisconsin Madison 2Microsoft Research 3Columbia University
Pseudocode No The information is insufficient. The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code Yes We make GPT-4 generated visual instruction tuning data, our model, and code publicly available. We release the following assets to the public: the generated multimodal instruction data, the codebase, the model checkpoints, and a visual chat demo. Our source code, generated instruction-tuning data, proposed benchmark are uploaded to the anonymized Git Hub repository: LLa VA-Annonymous/LLa VA. 1. Source Code: link
Open Datasets Yes We use COCO images [30] and generate three types of instruction-following data. To strike a balance between concept coverage and training efficiency, we filter CC3M to 595K image-text pairs. Please see Appendix for details of the filtering process. We study our method on the Science QA benchmark [33], the first large-scale multimodal science question dataset that annotates the answers with detailed lectures and explanations.
Dataset Splits Yes The benchmark dataset is split into training, validation, and test splits with 12726, 4241, and 4241 examples, respectively.
Hardware Specification Yes We train all models with 8 A100s
Software Dependencies No The information is insufficient. The paper mentions tools like Vicuna and CLIP, and training components like Adam optimizer, FSDP, BF16, and TF32, but does not provide specific version numbers for the software dependencies or programming languages used for implementation (e.g., Python, PyTorch/TensorFlow versions).
Experiment Setup Yes We pre-train our model on the filtered CC-595K subset for 1 epoch with a learning rate of 2e-3 and a batch size of 128, and fine-tune on the proposed LLa VA-Instruct-158K dataset for 3 epochs, with a learning rate of 2e-5 and a batch size of 32. We train all models with 8 A100s, following Vicuna s hyperparameters [9].