Tuning LayerNorm in Attention: Towards Efficient Multi-Modal LLM Finetuning
Authors: Bingchen Zhao, Haoqin Tu, Chen Wei, Jieru Mei, Cihang Xie
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This paper introduces an efficient strategy to transform Large Language Models (LLMs) into Multi-Modal Large Language Models. By conceptualizing this transformation as a domain adaptation process, i.e., transitioning from text understanding to embracing multiple modalities, we intriguingly note that, within each attention block, tuning Layer Norm suffices to yield strong performance. Moreover, when benchmarked against other tuning approaches like full parameter finetuning or Lo RA, its benefits on efficiency are substantial. For example, when compared to Lo RA on a 13B model scale, performance can be enhanced by an average of over 20% across five multi-modal tasks, and meanwhile, results in a significant reduction of trainable parameters by 41.9% and a decrease in GPU memory usage by 17.6%. We conduct all of our experiments on 4 80G A100 GPUs on the same node. |
| Researcher Affiliation | Academia | 1 University of Edinburgh 2 University of Chinese Academy of Sciences 3 Johns Hopkins University 4 UC Santa Cruz |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any specific repository links or explicit statements about the release of their source code. |
| Open Datasets | Yes | We pre-train the vision-language connector for 3 epochs on CC3M (Changpinyo et al., 2021), and conduct the finetuning stage on 80K filtered image-text pairs collected by Liu et al. (2023) for 1 epoch. We test the visual-instruction tuned models on recent multi-modal evaluation benchmarks, where five multi-modal benchmarks are deployed: MME (Fu et al., 2023) consists of two evaluation aspects, i.e., cognition (CS) and perception (PS) with total 14 VQA tasks; VQAv2 (Antol et al., 2015), MSCOCO (Lin et al., 2014) and Flickr30k (Young et al., 2014) captioning tasks are commonly used benchmarks in the field of VQA and captioning. The former two benchmarks are based on MSCOCO-2017 dataset (Lin et al., 2014). For the latter two captioning tasks, we report the zero-shot CIDEr (Vedantam et al., 2015) scores (with three text-only QA examples) on the test set from Karpathy & Fei-Fei (2015). POPE (Li et al., 2023b) is used to evaluate the level of object hallucinations in MLLMs, which consists of three versions of balanced yes/no VQA tasks (i.e., Popular/Random/Adversarial) considering objects in the given image. |
| Dataset Splits | Yes | During the second stage, we search the learning rate from 2e-3 to 1e-7 with 11 options for all tuning strategies and pick the best learning rate based on their performances on Flickr30k task. We test the visual-instruction tuned models on recent multi-modal evaluation benchmarks, where five multi-modal benchmarks are deployed: MME (Fu et al., 2023) consists of two evaluation aspects, i.e., cognition (CS) and perception (PS) with total 14 VQA tasks; VQAv2 (Antol et al., 2015), MSCOCO (Lin et al., 2014) and Flickr30k (Young et al., 2014) captioning tasks are commonly used benchmarks in the field of VQA and captioning. The former two benchmarks are based on MSCOCO-2017 dataset (Lin et al., 2014). For the latter two captioning tasks, we report the zero-shot CIDEr (Vedantam et al., 2015) scores (with three text-only QA examples) on the test set from Karpathy & Fei-Fei (2015). POPE (Li et al., 2023b) is used to evaluate the level of object hallucinations in MLLMs, which consists of three versions of balanced yes/no VQA tasks (i.e., Popular/Random/Adversarial) considering objects in the given image. |
| Hardware Specification | Yes | We conduct all of our experiments on 4 80G A100 GPUs on the same node. |
| Software Dependencies | No | The paper mentions 'Deep Speed technique (Rajbhandari et al., 2020), and a data precision of Tensor Float32' but does not provide specific version numbers for these or any other software dependencies. |
| Experiment Setup | Yes | For the first stage, we set the learning rate to 2e-3 for all variants. During the second stage, we search the learning rate from 2e-3 to 1e-7 with 11 options for all tuning strategies and pick the best learning rate based on their performances on Flickr30k task. We set the weight decay (Loshchilov & Hutter, 2019) to 0 and a warmup ratio to 0.03 with the cosine learning rate scheduler (Loshchilov & Hutter, 2017). Moreover, we employ the gradient checkpointing (Chen et al., 2016), Deep Speed technique (Rajbhandari et al., 2020), and a data precision of Tensor Float32 for models training. |