UniAdapter: Unified Parameter-Efficient Transfer Learning for Cross-modal Modeling

Authors: Haoyu Lu, Yuqi Huo, Guoxing Yang, Zhiwu Lu, Wei Zhan, Masayoshi Tomizuka, Mingyu Ding

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on 7 cross-modal downstream benchmarks (including video-text retrieval, image-text retrieval, Video QA, VQA and caption) show that in most cases, Uni Adapter not only outperforms the state-of-the-arts, but even surpasses the full fine-tuning strategy.
Researcher Affiliation Collaboration 1Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China 2University of California, Berkeley, United States 3Baichuan Inc.
Pseudocode No The paper describes its methodology using text and mathematical equations but does not include any structured pseudocode or algorithm blocks.
Open Source Code Yes The code and models are available at https://github.com/RERV/Uni Adapter.
Open Datasets Yes We evaluate our proposed Uni Adapter on 7 downstream datasets, including video-text retrieval datasets: MSR-VTT (Xu et al., 2016) and Di De Mo (Hendricks et al., 2017); image-text retrieval datasets: MSCOCO (Lin et al., 2014) and Flickr30K (Plummer et al., 2015); video question answering dataset: MSRVTT-QA (Xu et al., 2017); visual question answering dataset: VQAv2 (Goyal et al., 2017); and Caption dataset: MSCOCO (Lin et al., 2014).
Dataset Splits Yes We follow recent works (Lei et al., 2021; Luo et al., 2021) to adopt the 1k-A split (with 9,000/1,000 videos) for training/testing.
Hardware Specification Yes All experiments are conducted on 8x NVIDIA 3090Ti (24G) GPUs.
Software Dependencies No The paper mentions specific frameworks and initialization methods (e.g., BLIP-base, Kaiming Normal) but does not provide version numbers for ancillary software dependencies like Python, PyTorch, or CUDA.
Experiment Setup Yes We set the Uni Adapter hyperparameters uniformly for all modalities as: input/output dimension d = 768, bottleneck dimension r = 128 (4.8M) or r = 512 (19.0M), and scaling factor s = 0.1. Following previous works, we initialize the weights of down-projection layers for Uni Adapter with Kaiming Normal (He et al., 2015) and configure the weights of the up-projection layers with zero initialization. For video-text downstream tasks, we uniformly sample N = 8 frames per video during training, N = 16 frames per video during inference (but N = 8 for ablation study). All experiments are conducted on 8x NVIDIA 3090Ti (24G) GPUs. More details are given in Appendix. The paper also includes 'Table 16: Parameter-efficient fine-tuning hyperparameters for each task.'