Lever LM: Configuring In-Context Sequence to Lever Large Vision Language Models
Authors: Xu Yang, Yingzhe Peng, Haoxuan Ma, Shuo Xu, Chi Zhang, Yucheng Han, Hanwang Zhang
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments show that these ICD sequences can improve the ICL performance of two LVLMs compared with some strong baselines in Visual Question Answering and Image Captioning, validating that Lever-LM can really capture the statistical patterns for levering LVLMs. |
| Researcher Affiliation | Academia | Xu Yang1,2 , Yingzhe Peng1,2, Haoxuan Ma1,2, Shuo Xu1,2, Chi Zhang3, Yucheng Han4, Hanwang Zhang4 1 Southeast University 2 Key Laboratory of New Generation Artificial Intelligence Technology & Its Interdisciplinary Applications, (Southeast University),Ministry of Education 3 Westlake University 4 Nanyang Technological University |
| Pseudocode | No | The paper describes the architecture and process in text and diagrams (Figure 2), but does not provide structured pseudocode or an algorithm block. |
| Open Source Code | Yes | The code is available at https://github.com/For Jade Forest/Lever-LM. |
| Open Datasets | Yes | Our approach is evaluated on MS-COCO [56] for Image Captioning (IC) and VQAV2 [60] for Visual Question Answering (VQA). For each corresponding dataset, we use the train split to construct the DM and use the validation split to evaluate the performance of ICD configurations generated by Lever-LM. More details are given in Appendix A. |
| Dataset Splits | Yes | Our approach is evaluated on MS-COCO [56] for Image Captioning (IC) and VQAV2 [60] for Visual Question Answering (VQA). For each corresponding dataset, we use the train split to construct the DM and use the validation split to evaluate the performance of ICD configurations generated by Lever-LM. More details are given in Appendix A. |
| Hardware Specification | Yes | All experiments are deployed on an RTX 3090. All training processes are carried out with mixed precision and 2 RTX3090 GPUs. |
| Software Dependencies | No | The paper mentions software like Adam W optimizer, Open Flamingo, IDEFICS, and CLIP, but does not specify their version numbers (e.g., PyTorch version, specific library versions). |
| Experiment Setup | Yes | The training phase leverages the Adam W optimizer [61] and a cosine learning rate scheduler. We set the learning rate to 1 10 4 and the batch size to 128. We train our Lever-LM for 20 epochs. To implement ICL, we use Open Flamingo V2-9B [34] and IDEFICS-9B [14] as our LVLMs. |