Unveiling Encoder-Free Vision-Language Models
Authors: Haiwen Diao, Yufeng Cui, Xiaotong Li, Yueze Wang, Huchuan Lu, Xinlong Wang
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we bridge the gap between encoder-based and encoder-free models, and present a simple yet effective training recipe towards pure VLMs. Specifically, we unveil the key aspects of training encoder-free VLMs efficiently via thorough experiments: (1) Bridging vision-language representation inside one unified decoder; (2) Enhancing visual recognition capability via extra supervision. With these strategies, we launch EVE, an encoder-free vision-language model that can be trained and forwarded efficiently. Notably, solely utilizing 35M publicly accessible data, EVE can impressively rival the encoder-based VLMs of similar capacities across multiple vision-language benchmarks. |
| Researcher Affiliation | Collaboration | Haiwen Diao1,2 Yufeng Cui2 Xiaotong Li3,2 Yueze Wang2 Huchuan Lu1 Xinlong Wang2 1Dalian University of Technology 2Beijing Academy of Artificial Intelligence 3Peking University diaohw@mail.dlut.edu.cn, yfcui@baai.ac.cn, lixiaotong@stu.pku.edu.cn yzwang@baai.ac.cn, lhchuan@dlut.edu.cn, wangxinlong@baai.ac.cn |
| Pseudocode | No | The paper describes methods in text and uses figures to illustrate architectures, but does not provide structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code & Models: https://github.com/baaivision/EVE |
| Open Datasets | Yes | We train EVE using 33M publicly accessible samples from SA-1B [36], Open Images [38], and LAION [63]. |
| Dataset Splits | No | The paper mentions 'validation' as a concept in relation to its experiments but does not explicitly specify validation dataset splits (percentages or counts) or reference standard predefined splits for reproducibility beyond stating the datasets used. |
| Hardware Specification | Yes | From the above perspective, we launch EVE-7B, an encoder-free VLM evolved from Vicuna-7B [10] and trained with two 8-A100 (40G) nodes in ~9 days. |
| Software Dependencies | No | The paper mentions 'Adam W optimizer [35]', 'Vicuna-7B [10]', 'Deep Speed stage 3' and 'PyTorch' implicitly through citations, but does not provide specific version numbers for these software components or any other libraries used. |
| Experiment Setup | Yes | The maximum learning rates for Stage 1, 2, 3 are 4 10 4, 4 10 5, 2 10 5, while the number of batch size and training samples are 512, 512, 128 and 16M, 33M, 665K for EVE-7B |