Fine-tuning Multimodal LLMs to Follow Zero-shot Demonstrative Instructions

Authors: Juncheng Li, Kaihang Pan, Zhiqi Ge, Minghe Gao, Wei Ji, Wenqiao Zhang, Tat-Seng Chua, Siliang Tang, Hanwang Zhang, Yueting Zhuang

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental As for evaluation, we build DEMON, a comprehensive benchmark for demonstrative instruction understanding. Synthetically trained with the proposed strategy, VPG-C achieves significantly stronger zero-shot performance across all tasks of DEMON. Further evaluation on the MME and Owl Eval benchmarks also demonstrate the superiority of VPG-C. 4 EXPERIMENTS
Researcher Affiliation Academia 1Zhejiang University, 2National University of Singapore, 3Nanyang Technological University
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes The code and models are available at https://github.com/DCDmllm/Cheetah.
Open Datasets Yes Table 6: Summary of the demonstrative instruction-following tasks in DEMON benchmark. includes datasets like ALFRED (Shridhar et al., 2020) and MMCo QA (Li et al., 2022e).
Dataset Splits No The paper mentions using 500k image-caption pairs from CC3M for training but does not provide specific percentages, counts, or methodology for train/validation/test dataset splits.
Hardware Specification Yes Without additional demonstrative instruction data, the lightweight VPG-C module can be effectively tuned by the synthetic training strategy in several hours with a single A100 GPU.
Software Dependencies No The paper mentions implementing VPG-C in the LAVIS library, but does not provide specific version numbers for software dependencies like Python, PyTorch, or other libraries used in the experiments.
Experiment Setup Yes We tune the VPG-C module for 18k steps using a batch size of 24 for synthetic training and 64 for image captioning. We adopt the Adam W optimizer with  = (0.9, 0.999), and set the learning rate and weight decay to 0.00002 and 0.05, respectively. We warm up the training with 2k warm-up steps, followed by a learning rate decay mechanism with the cosine schedule.