Fine-tuning Multimodal LLMs to Follow Zero-shot Demonstrative Instructions
Authors: Juncheng Li, Kaihang Pan, Zhiqi Ge, Minghe Gao, Wei Ji, Wenqiao Zhang, Tat-Seng Chua, Siliang Tang, Hanwang Zhang, Yueting Zhuang
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | As for evaluation, we build DEMON, a comprehensive benchmark for demonstrative instruction understanding. Synthetically trained with the proposed strategy, VPG-C achieves significantly stronger zero-shot performance across all tasks of DEMON. Further evaluation on the MME and Owl Eval benchmarks also demonstrate the superiority of VPG-C. 4 EXPERIMENTS |
| Researcher Affiliation | Academia | 1Zhejiang University, 2National University of Singapore, 3Nanyang Technological University |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code and models are available at https://github.com/DCDmllm/Cheetah. |
| Open Datasets | Yes | Table 6: Summary of the demonstrative instruction-following tasks in DEMON benchmark. includes datasets like ALFRED (Shridhar et al., 2020) and MMCo QA (Li et al., 2022e). |
| Dataset Splits | No | The paper mentions using 500k image-caption pairs from CC3M for training but does not provide specific percentages, counts, or methodology for train/validation/test dataset splits. |
| Hardware Specification | Yes | Without additional demonstrative instruction data, the lightweight VPG-C module can be effectively tuned by the synthetic training strategy in several hours with a single A100 GPU. |
| Software Dependencies | No | The paper mentions implementing VPG-C in the LAVIS library, but does not provide specific version numbers for software dependencies like Python, PyTorch, or other libraries used in the experiments. |
| Experiment Setup | Yes | We tune the VPG-C module for 18k steps using a batch size of 24 for synthetic training and 64 for image captioning. We adopt the Adam W optimizer with = (0.9, 0.999), and set the learning rate and weight decay to 0.00002 and 0.05, respectively. We warm up the training with 2k warm-up steps, followed by a learning rate decay mechanism with the cosine schedule. |