reproducibilityindex.ai

Fine-tuning Multimodal LLMs to Follow Zero-shot Demonstrative Instructions

Authors: Juncheng Li, Kaihang Pan, Zhiqi Ge, Minghe Gao, Wei Ji, Wenqiao Zhang, Tat-Seng Chua, Siliang Tang, Hanwang Zhang, Yueting Zhuang

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	As for evaluation, we build DEMON, a comprehensive benchmark for demonstrative instruction understanding. Synthetically trained with the proposed strategy, VPG-C achieves significantly stronger zero-shot performance across all tasks of DEMON. Further evaluation on the MME and Owl Eval benchmarks also demonstrate the superiority of VPG-C. 4 EXPERIMENTS
Researcher Affiliation	Academia	1Zhejiang University, 2National University of Singapore, 3Nanyang Technological University
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code	Yes	The code and models are available at https://github.com/DCDmllm/Cheetah.
Open Datasets	Yes	Table 6: Summary of the demonstrative instruction-following tasks in DEMON benchmark. includes datasets like ALFRED (Shridhar et al., 2020) and MMCo QA (Li et al., 2022e).
Dataset Splits	No	The paper mentions using 500k image-caption pairs from CC3M for training but does not provide specific percentages, counts, or methodology for train/validation/test dataset splits.
Hardware Specification	Yes	Without additional demonstrative instruction data, the lightweight VPG-C module can be effectively tuned by the synthetic training strategy in several hours with a single A100 GPU.
Software Dependencies	No	The paper mentions implementing VPG-C in the LAVIS library, but does not provide specific version numbers for software dependencies like Python, PyTorch, or other libraries used in the experiments.
Experiment Setup	Yes	We tune the VPG-C module for 18k steps using a batch size of 24 for synthetic training and 64 for image captioning. We adopt the Adam W optimizer with = (0.9, 0.999), and set the learning rate and weight decay to 0.00002 and 0.05, respectively. We warm up the training with 2k warm-up steps, followed by a learning rate decay mechanism with the cosine schedule.