p-Laplacian Adaptation for Generative Pre-trained Vision-Language Models

Authors: Haoyuan Wu, Xinyun Zhang, Peng Xu, Peiyu Liao, Xufeng Yao, Bei Yu

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive experiments on different pretrained VLMs and multi-modal tasks, including visual question answering, visual entailment, and image captioning. The experimental results validate our method s significant superiority over other PETL methods.
Researcher Affiliation Academia Department of Computer Science & Engineering, The Chinese University of Hong Kong wuhyhowell@gmail.com, xyzhang21@cse.cuhk.edu.hk, byu@cse.cuhk.edu.hk
Pseudocode No The paper describes the proposed method using mathematical equations and diagrams, but it does not include pseudocode or clearly labeled algorithm blocks.
Open Source Code Yes Our code is available at https://github.com/wuhy68/p-Adapter/
Open Datasets Yes We test our model on VQA2.0 (Goyal et al. 2017) ... SNLI-VE (Xie et al. 2019) ... COCO Captions (Lin et al. 2014) ... Text Caps (Sidorov et al. 2020), and Viz Wiz Caps (Gurari et al. 2020).
Dataset Splits Yes We test our model on VQA2.0 (Goyal et al. 2017) with the widely-used Karpathy split (Karpathy and Fei-Fei 2015) and Viz Wiz VQA (Gurari et al. 2018). ... COCO Captions (Lin et al. 2014) with Karpathy split (Karpathy and Fei-Fei 2015)
Hardware Specification Yes Our experiments are implemented in Py Torch (Paszke et al. 2019) and conducted on 8 Nvidia 3090 GPUs.
Software Dependencies No The paper mentions "Py Torch (Paszke et al. 2019)" but does not specify a version number for PyTorch or any other software dependencies, which is required for reproducibility.
Experiment Setup Yes We use Adam W (Loshchilov and Hutter 2017) optimizer with a weight decay of 0.05 and apply a linear scheduler. We take random image crops of resolution 224 224 as the input of the encoder, and also apply Rand Augment (Cubuk et al. 2020) during the training... We train the model for five and two epochs for VQA and VE, and image captioning, respectively. We sweep a wide range of learning rates over {1 10 4, 2 10 4, 5 10 4, 1 10 3} for PETL methods, and use 2 10 5 for full fine-tuning