Visual Instruction Tuning
Authors: Haotian Liu, Chunyuan Li, Qingyang Wu, Yong Jae Lee
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments show that LLa VA demonstrates impressive multimodal chat abilities, sometimes exhibiting the behaviors of multimodal GPT-4 on unseen images/instructions, and yields a 85.1% relative score compared with GPT-4 on a synthetic multimodal instruction-following dataset. When fine-tuned on Science QA, the synergy of LLa VA and GPT-4 achieves a new state-of-the-art accuracy of 92.53%. We assess the performance of LLa VA in instruction-following and visual reasoning capabilities with two primary experimental settings: multimodal chatbot and the Science QA dataset, respectively. |
| Researcher Affiliation | Collaboration | Haotian Liu1 , Chunyuan Li2 , Qingyang Wu3, Yong Jae Lee1 1University of Wisconsin Madison 2Microsoft Research 3Columbia University |
| Pseudocode | No | The information is insufficient. The paper does not contain any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | Yes | We make GPT-4 generated visual instruction tuning data, our model, and code publicly available. We release the following assets to the public: the generated multimodal instruction data, the codebase, the model checkpoints, and a visual chat demo. Our source code, generated instruction-tuning data, proposed benchmark are uploaded to the anonymized Git Hub repository: LLa VA-Annonymous/LLa VA. 1. Source Code: link |
| Open Datasets | Yes | We use COCO images [30] and generate three types of instruction-following data. To strike a balance between concept coverage and training efficiency, we filter CC3M to 595K image-text pairs. Please see Appendix for details of the filtering process. We study our method on the Science QA benchmark [33], the first large-scale multimodal science question dataset that annotates the answers with detailed lectures and explanations. |
| Dataset Splits | Yes | The benchmark dataset is split into training, validation, and test splits with 12726, 4241, and 4241 examples, respectively. |
| Hardware Specification | Yes | We train all models with 8 A100s |
| Software Dependencies | No | The information is insufficient. The paper mentions tools like Vicuna and CLIP, and training components like Adam optimizer, FSDP, BF16, and TF32, but does not provide specific version numbers for the software dependencies or programming languages used for implementation (e.g., Python, PyTorch/TensorFlow versions). |
| Experiment Setup | Yes | We pre-train our model on the filtered CC-595K subset for 1 epoch with a learning rate of 2e-3 and a batch size of 128, and fine-tune on the proposed LLa VA-Instruct-158K dataset for 3 epochs, with a learning rate of 2e-5 and a batch size of 32. We train all models with 8 A100s, following Vicuna s hyperparameters [9]. |