Memory-Space Visual Prompting for Efficient Vision-Language Fine-Tuning

Authors: Shibo Jie, Yehui Tang, Ning Ding, Zhi-Hong Deng, Kai Han, Yunhe Wang

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results across various VL tasks and language models reveal that Mem VP significantly reduces the training time and inference latency of the finetuned VL models and surpasses the performance of previous PEFT methods.
Researcher Affiliation Collaboration 1School of Intelligence Science and Technology, Peking University 2Huawei Noah s Ark Lab 3National Key Laboratory of General Artificial Intelligence.
Pseudocode No The paper provides mathematical formulations and diagrams but does not include explicit pseudocode or algorithm blocks.
Open Source Code Yes Code: https: //github.com/Jie Shibo/Mem VP
Open Datasets Yes For visual question answering, we evaluate our method on VQAv2 (Goyal et al., 2017) and GQA (Hudson & Manning, 2019); for image captioning, we evaluate on COCO Captions (Chen et al., 2015). Additionally, we use a challenging VQA task, Science QA (Lu et al., 2022).
Dataset Splits No While the paper mentions reporting results on "validation sets" for TVQA and How2QA (Appendix B.1) and "test set" or "test-dev split" for VQAv2, GQA, COCO Captions, and Science QA, it does not provide specific details on how the training, validation, and test splits were performed (e.g., percentages or sample counts) for all datasets to allow full reproduction of data partitioning.
Hardware Specification Yes We show the inference speed across different lengths of input and output on LLa MA-7B on a single V100. Measured on V100 GPUs. Measured on 8 A800 GPUs
Software Dependencies No The paper does not specify the versions of software dependencies such as Python, PyTorch, or CUDA.
Experiment Setup Yes We train on each dataset for 20 epochs with batch size 8 64 and report performance on the test set. The hyperparameters of all methods are summarized in Appendix. Table 5. Hyperparameters on BART-base and T5-base. (Learning Rate, Batch Size, Epoch, Structure Hyper-Parameters) Table 7. Hyperparameters on LLa MA. (Learning Rate, Batch Size, Epoch, Structure Hyper-Parameters)