UniAdapter: Unified Parameter-Efficient Transfer Learning for Cross-modal Modeling
Authors: Haoyu Lu, Yuqi Huo, Guoxing Yang, Zhiwu Lu, Wei Zhan, Masayoshi Tomizuka, Mingyu Ding
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on 7 cross-modal downstream benchmarks (including video-text retrieval, image-text retrieval, Video QA, VQA and caption) show that in most cases, Uni Adapter not only outperforms the state-of-the-arts, but even surpasses the full fine-tuning strategy. |
| Researcher Affiliation | Collaboration | 1Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China 2University of California, Berkeley, United States 3Baichuan Inc. |
| Pseudocode | No | The paper describes its methodology using text and mathematical equations but does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code and models are available at https://github.com/RERV/Uni Adapter. |
| Open Datasets | Yes | We evaluate our proposed Uni Adapter on 7 downstream datasets, including video-text retrieval datasets: MSR-VTT (Xu et al., 2016) and Di De Mo (Hendricks et al., 2017); image-text retrieval datasets: MSCOCO (Lin et al., 2014) and Flickr30K (Plummer et al., 2015); video question answering dataset: MSRVTT-QA (Xu et al., 2017); visual question answering dataset: VQAv2 (Goyal et al., 2017); and Caption dataset: MSCOCO (Lin et al., 2014). |
| Dataset Splits | Yes | We follow recent works (Lei et al., 2021; Luo et al., 2021) to adopt the 1k-A split (with 9,000/1,000 videos) for training/testing. |
| Hardware Specification | Yes | All experiments are conducted on 8x NVIDIA 3090Ti (24G) GPUs. |
| Software Dependencies | No | The paper mentions specific frameworks and initialization methods (e.g., BLIP-base, Kaiming Normal) but does not provide version numbers for ancillary software dependencies like Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | We set the Uni Adapter hyperparameters uniformly for all modalities as: input/output dimension d = 768, bottleneck dimension r = 128 (4.8M) or r = 512 (19.0M), and scaling factor s = 0.1. Following previous works, we initialize the weights of down-projection layers for Uni Adapter with Kaiming Normal (He et al., 2015) and configure the weights of the up-projection layers with zero initialization. For video-text downstream tasks, we uniformly sample N = 8 frames per video during training, N = 16 frames per video during inference (but N = 8 for ablation study). All experiments are conducted on 8x NVIDIA 3090Ti (24G) GPUs. More details are given in Appendix. The paper also includes 'Table 16: Parameter-efficient fine-tuning hyperparameters for each task.' |