Unveiling Encoder-Free Vision-Language Models

Authors: Haiwen Diao, Yufeng Cui, Xiaotong Li, Yueze Wang, Huchuan Lu, Xinlong Wang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we bridge the gap between encoder-based and encoder-free models, and present a simple yet effective training recipe towards pure VLMs. Specifically, we unveil the key aspects of training encoder-free VLMs efficiently via thorough experiments: (1) Bridging vision-language representation inside one unified decoder; (2) Enhancing visual recognition capability via extra supervision. With these strategies, we launch EVE, an encoder-free vision-language model that can be trained and forwarded efficiently. Notably, solely utilizing 35M publicly accessible data, EVE can impressively rival the encoder-based VLMs of similar capacities across multiple vision-language benchmarks.
Researcher Affiliation Collaboration Haiwen Diao1,2 Yufeng Cui2 Xiaotong Li3,2 Yueze Wang2 Huchuan Lu1 Xinlong Wang2 1Dalian University of Technology 2Beijing Academy of Artificial Intelligence 3Peking University diaohw@mail.dlut.edu.cn, yfcui@baai.ac.cn, lixiaotong@stu.pku.edu.cn yzwang@baai.ac.cn, lhchuan@dlut.edu.cn, wangxinlong@baai.ac.cn
Pseudocode No The paper describes methods in text and uses figures to illustrate architectures, but does not provide structured pseudocode or algorithm blocks.
Open Source Code Yes Code & Models: https://github.com/baaivision/EVE
Open Datasets Yes We train EVE using 33M publicly accessible samples from SA-1B [36], Open Images [38], and LAION [63].
Dataset Splits No The paper mentions 'validation' as a concept in relation to its experiments but does not explicitly specify validation dataset splits (percentages or counts) or reference standard predefined splits for reproducibility beyond stating the datasets used.
Hardware Specification Yes From the above perspective, we launch EVE-7B, an encoder-free VLM evolved from Vicuna-7B [10] and trained with two 8-A100 (40G) nodes in ~9 days.
Software Dependencies No The paper mentions 'Adam W optimizer [35]', 'Vicuna-7B [10]', 'Deep Speed stage 3' and 'PyTorch' implicitly through citations, but does not provide specific version numbers for these software components or any other libraries used.
Experiment Setup Yes The maximum learning rates for Stage 1, 2, 3 are 4 10 4, 4 10 5, 2 10 5, while the number of batch size and training samples are 512, 512, 128 and 16M, 33M, 665K for EVE-7B