reproducibilityindex.ai

Unveiling Encoder-Free Vision-Language Models

Authors: Haiwen Diao, Yufeng Cui, Xiaotong Li, Yueze Wang, Huchuan Lu, Xinlong Wang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work, we bridge the gap between encoder-based and encoder-free models, and present a simple yet effective training recipe towards pure VLMs. Specifically, we unveil the key aspects of training encoder-free VLMs efficiently via thorough experiments: (1) Bridging vision-language representation inside one unified decoder; (2) Enhancing visual recognition capability via extra supervision. With these strategies, we launch EVE, an encoder-free vision-language model that can be trained and forwarded efficiently. Notably, solely utilizing 35M publicly accessible data, EVE can impressively rival the encoder-based VLMs of similar capacities across multiple vision-language benchmarks.
Researcher Affiliation	Collaboration	Haiwen Diao1,2 Yufeng Cui2 Xiaotong Li3,2 Yueze Wang2 Huchuan Lu1 Xinlong Wang2 1Dalian University of Technology 2Beijing Academy of Artificial Intelligence 3Peking University diaohw@mail.dlut.edu.cn, yfcui@baai.ac.cn, lixiaotong@stu.pku.edu.cn yzwang@baai.ac.cn, lhchuan@dlut.edu.cn, wangxinlong@baai.ac.cn
Pseudocode	No	The paper describes methods in text and uses figures to illustrate architectures, but does not provide structured pseudocode or algorithm blocks.
Open Source Code	Yes	Code & Models: https://github.com/baaivision/EVE
Open Datasets	Yes	We train EVE using 33M publicly accessible samples from SA-1B [36], Open Images [38], and LAION [63].
Dataset Splits	No	The paper mentions 'validation' as a concept in relation to its experiments but does not explicitly specify validation dataset splits (percentages or counts) or reference standard predefined splits for reproducibility beyond stating the datasets used.
Hardware Specification	Yes	From the above perspective, we launch EVE-7B, an encoder-free VLM evolved from Vicuna-7B [10] and trained with two 8-A100 (40G) nodes in ~9 days.
Software Dependencies	No	The paper mentions 'Adam W optimizer [35]', 'Vicuna-7B [10]', 'Deep Speed stage 3' and 'PyTorch' implicitly through citations, but does not provide specific version numbers for these software components or any other libraries used.
Experiment Setup	Yes	The maximum learning rates for Stage 1, 2, 3 are 4 10 4, 4 10 5, 2 10 5, while the number of batch size and training samples are 512, 512, 128 and 16M, 33M, 665K for EVE-7B