reproducibilityindex.ai

Visual Instruction Tuning

Authors: Haotian Liu, Chunyuan Li, Qingyang Wu, Yong Jae Lee

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments show that LLa VA demonstrates impressive multimodal chat abilities, sometimes exhibiting the behaviors of multimodal GPT-4 on unseen images/instructions, and yields a 85.1% relative score compared with GPT-4 on a synthetic multimodal instruction-following dataset. When ﬁne-tuned on Science QA, the synergy of LLa VA and GPT-4 achieves a new state-of-the-art accuracy of 92.53%. We assess the performance of LLa VA in instruction-following and visual reasoning capabilities with two primary experimental settings: multimodal chatbot and the Science QA dataset, respectively.
Researcher Affiliation	Collaboration	Haotian Liu1 , Chunyuan Li2 , Qingyang Wu3, Yong Jae Lee1 1University of Wisconsin Madison 2Microsoft Research 3Columbia University
Pseudocode	No	The information is insufficient. The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code	Yes	We make GPT-4 generated visual instruction tuning data, our model, and code publicly available. We release the following assets to the public: the generated multimodal instruction data, the codebase, the model checkpoints, and a visual chat demo. Our source code, generated instruction-tuning data, proposed benchmark are uploaded to the anonymized Git Hub repository: LLa VA-Annonymous/LLa VA. 1. Source Code: link
Open Datasets	Yes	We use COCO images [30] and generate three types of instruction-following data. To strike a balance between concept coverage and training efﬁciency, we ﬁlter CC3M to 595K image-text pairs. Please see Appendix for details of the ﬁltering process. We study our method on the Science QA benchmark [33], the ﬁrst large-scale multimodal science question dataset that annotates the answers with detailed lectures and explanations.
Dataset Splits	Yes	The benchmark dataset is split into training, validation, and test splits with 12726, 4241, and 4241 examples, respectively.
Hardware Specification	Yes	We train all models with 8 A100s
Software Dependencies	No	The information is insufficient. The paper mentions tools like Vicuna and CLIP, and training components like Adam optimizer, FSDP, BF16, and TF32, but does not provide specific version numbers for the software dependencies or programming languages used for implementation (e.g., Python, PyTorch/TensorFlow versions).
Experiment Setup	Yes	We pre-train our model on the ﬁltered CC-595K subset for 1 epoch with a learning rate of 2e-3 and a batch size of 128, and ﬁne-tune on the proposed LLa VA-Instruct-158K dataset for 3 epochs, with a learning rate of 2e-5 and a batch size of 32. We train all models with 8 A100s, following Vicuna s hyperparameters [9].