Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Fine-tuning Multimodal LLMs to Follow Zero-shot Demonstrative Instructions
Authors: Juncheng Li, Kaihang Pan, Zhiqi Ge, Minghe Gao, Wei Ji, Wenqiao Zhang, Tat-Seng Chua, Siliang Tang, Hanwang Zhang, Yueting Zhuang
ICLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | As for evaluation, we build DEMON, a comprehensive benchmark for demonstrative instruction understanding. Synthetically trained with the proposed strategy, VPG-C achieves significantly stronger zero-shot performance across all tasks of DEMON. Further evaluation on the MME and Owl Eval benchmarks also demonstrate the superiority of VPG-C. 4 EXPERIMENTS |
| Researcher Affiliation | Academia | 1Zhejiang University, 2National University of Singapore, 3Nanyang Technological University |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code and models are available at https://github.com/DCDmllm/Cheetah. |
| Open Datasets | Yes | Table 6: Summary of the demonstrative instruction-following tasks in DEMON benchmark. includes datasets like ALFRED (Shridhar et al., 2020) and MMCo QA (Li et al., 2022e). |
| Dataset Splits | No | The paper mentions using 500k image-caption pairs from CC3M for training but does not provide specific percentages, counts, or methodology for train/validation/test dataset splits. |
| Hardware Specification | Yes | Without additional demonstrative instruction data, the lightweight VPG-C module can be effectively tuned by the synthetic training strategy in several hours with a single A100 GPU. |
| Software Dependencies | No | The paper mentions implementing VPG-C in the LAVIS library, but does not provide specific version numbers for software dependencies like Python, PyTorch, or other libraries used in the experiments. |
| Experiment Setup | Yes | We tune the VPG-C module for 18k steps using a batch size of 24 for synthetic training and 64 for image captioning. We adopt the Adam W optimizer with = (0.9, 0.999), and set the learning rate and weight decay to 0.00002 and 0.05, respectively. We warm up the training with 2k warm-up steps, followed by a learning rate decay mechanism with the cosine schedule. |