VPGTrans: Transfer Visual Prompt Generator across LLMs
Authors: Ao Zhang, Hao Fei, Yuan Yao, Wei Ji, Li Li, Zhiyuan Liu, Tat-Seng Chua
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Via extensive experiments on the transfer across LLM sizes and types (cf. 4 & 5), we gain the following key observations: |
| Researcher Affiliation | Academia | Ao Zhang 1 Hao Fei 1 Yuan Yao 2 Wei Ji 1 Li Li 1 Zhiyuan Liu 2 Tat-Seng Chua 1 1 NEx T++ Lab, School of Computing, National University of Singapore 2Department of Computer Science and Technology, Tsinghua University |
| Pseudocode | No | No structured pseudocode or algorithm blocks were found in the paper. |
| Open Source Code | Yes | All codes and models is released at https://github.com/VPGTrans/VPGTrans. |
| Open Datasets | Yes | For all of the exploration experiments, we adopt human-annotated COCO caption dataset [34] and web image-text pairs SBU dataset [40], which results in 1.4 million image-text pairs. |
| Dataset Splits | No | The paper does not provide specific train/validation/test dataset splits (e.g., percentages or sample counts) for reproducibility, although it mentions evaluating on common datasets and includes 'val' in table headers. |
| Hardware Specification | Yes | For example, training a BLIP-2 Flan T5XXL needs over 600 A100-GPU hours on over 100 million image-text pairs. The word converter training only requires updating a linear layer on tokenized text data and typically takes less than 10 minutes on 1 A100 GPU with less than 15G GPU memory. |
| Software Dependencies | No | The paper mentions using FP16 and BFloat16, and following BLIP-2's open code, but does not provide specific software names with version numbers (e.g., Python, PyTorch versions). |
| Experiment Setup | Yes | For the learning rate, we first conduct a linear warm-up from 1e-6 to 1e-4, and then use a cosine learning rate schedule with the minimal lr=1e-5 for 10 epochs. Specifically, we set the batch size of 1,728 and 1,152 for OPT and Flan T5-based models, respectively. |