VPGTrans: Transfer Visual Prompt Generator across LLMs

Authors: Ao Zhang, Hao Fei, Yuan Yao, Wei Ji, Li Li, Zhiyuan Liu, Tat-Seng Chua

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Via extensive experiments on the transfer across LLM sizes and types (cf. 4 & 5), we gain the following key observations:
Researcher Affiliation Academia Ao Zhang 1 Hao Fei 1 Yuan Yao 2 Wei Ji 1 Li Li 1 Zhiyuan Liu 2 Tat-Seng Chua 1 1 NEx T++ Lab, School of Computing, National University of Singapore 2Department of Computer Science and Technology, Tsinghua University
Pseudocode No No structured pseudocode or algorithm blocks were found in the paper.
Open Source Code Yes All codes and models is released at https://github.com/VPGTrans/VPGTrans.
Open Datasets Yes For all of the exploration experiments, we adopt human-annotated COCO caption dataset [34] and web image-text pairs SBU dataset [40], which results in 1.4 million image-text pairs.
Dataset Splits No The paper does not provide specific train/validation/test dataset splits (e.g., percentages or sample counts) for reproducibility, although it mentions evaluating on common datasets and includes 'val' in table headers.
Hardware Specification Yes For example, training a BLIP-2 Flan T5XXL needs over 600 A100-GPU hours on over 100 million image-text pairs. The word converter training only requires updating a linear layer on tokenized text data and typically takes less than 10 minutes on 1 A100 GPU with less than 15G GPU memory.
Software Dependencies No The paper mentions using FP16 and BFloat16, and following BLIP-2's open code, but does not provide specific software names with version numbers (e.g., Python, PyTorch versions).
Experiment Setup Yes For the learning rate, we first conduct a linear warm-up from 1e-6 to 1e-4, and then use a cosine learning rate schedule with the minimal lr=1e-5 for 10 epochs. Specifically, we set the batch size of 1,728 and 1,152 for OPT and Flan T5-based models, respectively.