Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models
Authors: Gen Luo, Yiyi Zhou, Tianhe Ren, Shengxin Chen, Xiaoshuai Sun, Rongrong Ji
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To validate MMA and La VIN, we conduct extensive experiments under two setups, namely multimodal science question answering and multimodal dialogue. The experimental results not only demonstrate the competitive performance and the superior training efficiency of La VIN than existing multimodal LLMs, but also confirm its great potential as a general-purpose chatbot. |
| Researcher Affiliation | Academia | Gen Luo13, Yiyi Zhou12, Tianhe Ren1, Shengxin Chen1, Xiaoshuai Sun12, Rongrong Ji123 1Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, School of Informatics, Xiamen University, 361005, P.R. China. 2Institute of Artificial Intelligence, Xiamen University, 361005, P.R. China. 3 Peng Cheng Laboratory, Shenzhen, 518000, China. {luogen,chenshengxin,rentianhe}@stu.xmu.edu.cn, {zhouyiyi,xssun,rrji}@xmu.edu.cn |
| Pseudocode | No | The paper describes the proposed methods using text, equations, and diagrams, but does not include formal pseudocode blocks or algorithm listings. |
| Open Source Code | Yes | Our project is released at https://luogen1996. github.io/lavin. |
| Open Datasets | Yes | Science QA [24] is the large-scale multimodal dataset for science question answering... Alphaca-52k [42] contains 52k text-only instruction-following data generated by GPT-3.5 [3]. LLa VA-158k [21] is a large-scale text-image instruction-following dataset... Results on COCO Captioning. In Tab 4, we compare La VIN with existing methods on the task of image captioning. |
| Dataset Splits | Yes | Science QA consists of text-only and text-image examples in three splits namely train, val and test, with 12,726, 4,241 and 4,241 examples, respectively. |
| Hardware Specification | Yes | Notably, fine-tuning La VIN on Science QA only takes 1.4 hours with 8 A100 GPUs... our tuning only takes 4 GPU hours on 8 A100s... All results are evaluated on 8 A100 GPUs. |
| Software Dependencies | No | The paper mentions using "Adam W [23] as the optimizer" but does not specify version numbers for programming languages, libraries, or other software components used in the experiments. |
| Experiment Setup | Yes | The default dimension of the visual neck is set to 128. The dimension of MM-Adapter is 8, and the temperature is set to 10 for La VIN-7B and 5 for La VIN-13B... We adopt Adam W [23] as the optimizer, and train the model for 20 epochs with a cosine decay learning rate schedule. The batch size, learning rate and weight decay are set to 32, 9e-3 and 0.02, respectively. During the generation stage, the decoding uses top-p sampling with a temperature of 0.1 and a top-p value of 0.75, respectively. For the experiments of multimodal chatbot, all hyperparameters remain the same, except for the training epochs, which are reduced to 15. |