reproducibilityindex.ai

UniAdapter: Unified Parameter-Efficient Transfer Learning for Cross-modal Modeling

Authors: Haoyu Lu, Yuqi Huo, Guoxing Yang, Zhiwu Lu, Wei Zhan, Masayoshi Tomizuka, Mingyu Ding

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on 7 cross-modal downstream benchmarks (including video-text retrieval, image-text retrieval, Video QA, VQA and caption) show that in most cases, Uni Adapter not only outperforms the state-of-the-arts, but even surpasses the full fine-tuning strategy.
Researcher Affiliation	Collaboration	1Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China 2University of California, Berkeley, United States 3Baichuan Inc.
Pseudocode	No	The paper describes its methodology using text and mathematical equations but does not include any structured pseudocode or algorithm blocks.
Open Source Code	Yes	The code and models are available at https://github.com/RERV/Uni Adapter.
Open Datasets	Yes	We evaluate our proposed Uni Adapter on 7 downstream datasets, including video-text retrieval datasets: MSR-VTT (Xu et al., 2016) and Di De Mo (Hendricks et al., 2017); image-text retrieval datasets: MSCOCO (Lin et al., 2014) and Flickr30K (Plummer et al., 2015); video question answering dataset: MSRVTT-QA (Xu et al., 2017); visual question answering dataset: VQAv2 (Goyal et al., 2017); and Caption dataset: MSCOCO (Lin et al., 2014).
Dataset Splits	Yes	We follow recent works (Lei et al., 2021; Luo et al., 2021) to adopt the 1k-A split (with 9,000/1,000 videos) for training/testing.
Hardware Specification	Yes	All experiments are conducted on 8x NVIDIA 3090Ti (24G) GPUs.
Software Dependencies	No	The paper mentions specific frameworks and initialization methods (e.g., BLIP-base, Kaiming Normal) but does not provide version numbers for ancillary software dependencies like Python, PyTorch, or CUDA.
Experiment Setup	Yes	We set the Uni Adapter hyperparameters uniformly for all modalities as: input/output dimension d = 768, bottleneck dimension r = 128 (4.8M) or r = 512 (19.0M), and scaling factor s = 0.1. Following previous works, we initialize the weights of down-projection layers for Uni Adapter with Kaiming Normal (He et al., 2015) and configure the weights of the up-projection layers with zero initialization. For video-text downstream tasks, we uniformly sample N = 8 frames per video during training, N = 16 frames per video during inference (but N = 8 for ablation study). All experiments are conducted on 8x NVIDIA 3090Ti (24G) GPUs. More details are given in Appendix. The paper also includes 'Table 16: Parameter-efficient fine-tuning hyperparameters for each task.'