Lever LM: Configuring In-Context Sequence to Lever Large Vision Language Models

Authors: Xu Yang, Yingzhe Peng, Haoxuan Ma, Shuo Xu, Chi Zhang, Yucheng Han, Hanwang Zhang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments show that these ICD sequences can improve the ICL performance of two LVLMs compared with some strong baselines in Visual Question Answering and Image Captioning, validating that Lever-LM can really capture the statistical patterns for levering LVLMs.
Researcher Affiliation Academia Xu Yang1,2 , Yingzhe Peng1,2, Haoxuan Ma1,2, Shuo Xu1,2, Chi Zhang3, Yucheng Han4, Hanwang Zhang4 1 Southeast University 2 Key Laboratory of New Generation Artificial Intelligence Technology & Its Interdisciplinary Applications, (Southeast University),Ministry of Education 3 Westlake University 4 Nanyang Technological University
Pseudocode No The paper describes the architecture and process in text and diagrams (Figure 2), but does not provide structured pseudocode or an algorithm block.
Open Source Code Yes The code is available at https://github.com/For Jade Forest/Lever-LM.
Open Datasets Yes Our approach is evaluated on MS-COCO [56] for Image Captioning (IC) and VQAV2 [60] for Visual Question Answering (VQA). For each corresponding dataset, we use the train split to construct the DM and use the validation split to evaluate the performance of ICD configurations generated by Lever-LM. More details are given in Appendix A.
Dataset Splits Yes Our approach is evaluated on MS-COCO [56] for Image Captioning (IC) and VQAV2 [60] for Visual Question Answering (VQA). For each corresponding dataset, we use the train split to construct the DM and use the validation split to evaluate the performance of ICD configurations generated by Lever-LM. More details are given in Appendix A.
Hardware Specification Yes All experiments are deployed on an RTX 3090. All training processes are carried out with mixed precision and 2 RTX3090 GPUs.
Software Dependencies No The paper mentions software like Adam W optimizer, Open Flamingo, IDEFICS, and CLIP, but does not specify their version numbers (e.g., PyTorch version, specific library versions).
Experiment Setup Yes The training phase leverages the Adam W optimizer [61] and a cosine learning rate scheduler. We set the learning rate to 1 10 4 and the batch size to 128. We train our Lever-LM for 20 epochs. To implement ICL, we use Open Flamingo V2-9B [34] and IDEFICS-9B [14] as our LVLMs.