Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Unveiling Encoder-Free Vision-Language Models
Authors: Haiwen Diao, Yufeng Cui, Xiaotong Li, Yueze Wang, Huchuan Lu, Xinlong Wang
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we bridge the gap between encoder-based and encoder-free models, and present a simple yet effective training recipe towards pure VLMs. Specifically, we unveil the key aspects of training encoder-free VLMs efficiently via thorough experiments: (1) Bridging vision-language representation inside one unified decoder; (2) Enhancing visual recognition capability via extra supervision. With these strategies, we launch EVE, an encoder-free vision-language model that can be trained and forwarded efficiently. Notably, solely utilizing 35M publicly accessible data, EVE can impressively rival the encoder-based VLMs of similar capacities across multiple vision-language benchmarks. |
| Researcher Affiliation | Collaboration | Haiwen Diao1,2 Yufeng Cui2 Xiaotong Li3,2 Yueze Wang2 Huchuan Lu1 Xinlong Wang2 1Dalian University of Technology 2Beijing Academy of Artificial Intelligence 3Peking University EMAIL, EMAIL, EMAIL EMAIL, EMAIL, EMAIL |
| Pseudocode | No | The paper describes methods in text and uses figures to illustrate architectures, but does not provide structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code & Models: https://github.com/baaivision/EVE |
| Open Datasets | Yes | We train EVE using 33M publicly accessible samples from SA-1B [36], Open Images [38], and LAION [63]. |
| Dataset Splits | No | The paper mentions 'validation' as a concept in relation to its experiments but does not explicitly specify validation dataset splits (percentages or counts) or reference standard predefined splits for reproducibility beyond stating the datasets used. |
| Hardware Specification | Yes | From the above perspective, we launch EVE-7B, an encoder-free VLM evolved from Vicuna-7B [10] and trained with two 8-A100 (40G) nodes in ~9 days. |
| Software Dependencies | No | The paper mentions 'Adam W optimizer [35]', 'Vicuna-7B [10]', 'Deep Speed stage 3' and 'PyTorch' implicitly through citations, but does not provide specific version numbers for these software components or any other libraries used. |
| Experiment Setup | Yes | The maximum learning rates for Stage 1, 2, 3 are 4 10 4, 4 10 5, 2 10 5, while the number of batch size and training samples are 512, 512, 128 and 16M, 33M, 665K for EVE-7B |