Model Tailor: Mitigating Catastrophic Forgetting in Multi-modal Large Language Models

Authors: Didi Zhu, Zhongyisun Sun, Zexi Li, Tao Shen, Ke Yan, Shouhong Ding, Chao Wu, Kun Kuang

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through extensive experiments on Instruct BLIP and LLa VA-1.5 in both image captioning and visual question answering tasks, our approach demonstrates significant task adaptability while preserving inherent pre-trained capabilities.
Researcher Affiliation Collaboration 1Department of Computer and Science, Zhejiang University 2Tencent Youtu Lab.
Pseudocode No The paper describes its method in prose and mathematical equations but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not provide an explicit statement about releasing source code for Model Tailor, nor does it include a link to a code repository.
Open Datasets Yes For Instruct BLIP, following (Dai et al., 2023), we engage with datasets including COCO Caption (Lin et al., 2014), No Caps (in, near, out) (Agrawal et al., 2019), OKVQA (Marino et al., 2019), AOKVQA (Schwenk et al., 2022), GQA (Hudson & Manning, 2019), VQAv2 (Goyal et al., 2017), and Flickr30k (Young et al., 2014).
Dataset Splits No The paper lists datasets used for fine-tuning and evaluation but does not specify the train, validation, and test dataset splits or how these splits were determined for reproducibility.
Hardware Specification Yes All experiments were conducted on 8 NVIDIA V100 GPU with 32GB of memory.
Software Dependencies No The paper mentions using 'official codebase' guidelines for Instruct BLIP and LLa VA, and references 'Sparse GPT', but it does not provide specific version numbers for software dependencies like Python, PyTorch, or CUDA.
Experiment Setup Yes Standard Fine-tuning. In the case of Instruct BLIP, our fine-tuning procedure adhered to the official codebase 1 guidelines with a batch size of 12 for each task, a maximum of 5 epochs, and a learning rate of 1e-5. Similarly, for LLa VA, we followed the official fine-tuning protocols 2, setting the maximum epoch to 1. The initial learning rate was configured at 2e-4 for fine-tuning on Flickr30k and 1e-4 for GQA, with a batch size of 64.