Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

End-to-End Vision Tokenizer Tuning

Authors: Wenxuan Wang, Fan Zhang, Yufeng Cui, Haiwen Diao, Zhuoyan Luo, Huchuan Lu, Jing Liu, Xinlong Wang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that our end-to-end vision tokenizer tuning unlocks significant performance gains, i.e., 2-6% for multimodal understanding and visual generation tasks compared to frozen tokenizer baselines, while preserving the original reconstruction capability.
Researcher Affiliation Academia 1Institute of Automation, Chinese Academy of Sciences 2School of Artificial Intelligence, University of Chinese Academy of Sciences 3Beijing Academy of Artificial Intelligence 4Dalian University of Technology 5 Tsinghua University {EMAIL,EMAIL,EMAIL}
Pseudocode No The paper describes its methodology in Section 3, titled 'Methodology', using descriptive text and mathematical equations, but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No We use publicly accessible datasets for the main experiments in this paper. Once the blind review period is finished, we ll open-source the codes, instructions, and model checkpoints.
Open Datasets Yes (1) Vision-Language Pre-training & Vision Tokenizer Datasets. We adopt the pre-processing pipeline [8] to refine SA-1B [20], Open Images [22], and LAION [42], resulting in 11M, 7M, and 14M images respectively. We utilize the caption engine following [8] to produce 32M high-quality captions. (2) Supervised Fine-tuning Datasets. For understanding datasets, we extract 31.8M multi-task samples from Infinity-MM [15] and 3.5M instruction data from LLa VAOne Vision [27]... For generation datasets, we generate 14M AI-created samples with the Flux model [23] and further select 16M image-text pairs from open-source web data [12, 3], applying filters based on image resolution and aesthetic scores [24].
Dataset Splits Yes We validate ETT on various widely known vision-language perception benchmarks, covering task-specific evaluations (GQA [17] and Text VQA [44]), hallucination detection (POPE [29]), open-domain multimodal understanding (MME [11], MMBench [33], SEED-Bench [28], and MMVet [66]), and scientific reasoning (Science QA-IMG [34])... We comprehensively evaluate the text-to-image generation capabilities of our model against previous diffusion-based and autoregressive-based SOTA methods, including both multimodal specialists and generalists, on widely adopted benchmark datasets Gen Eval [14] and T2I-Comp Bench [16].
Hardware Specification Yes We train ETT on 8-A100 nodes using the Adam optimizer [19].
Software Dependencies No We train ETT on 8-A100 nodes using the Adam optimizer [19].
Experiment Setup Yes Implementation Details. We train ETT on 8-A100 nodes using the Adam optimizer [19]. The batch sizes for Stages 1, 2, and 3 are set to 1024, 1024, and 1024, respectively, with maximum learning rates of 4 10 5, 4 10 5, and 2 10 5. We apply a warm-up strategy with a 0.03 ratio and use a cosine decay scheduler across all stages. Unless otherwise specified, images are processed at a resolution of 5122, and ablation studies are reported using LLa VA-mix-665K [31] at Stages 3. For all the experiments in our work, we adopt Qwen2.5-1.5B [50] as the large language model for multimodal sequence modeling. For the vision tokenizer in ETT, we employ Adam optimizer [19] with a fixed learning rate of 1 10 4, β1 = 0.5 and β2 = 0.9. The tokenizer is trained for 500,000 steps with a global batch size of 256 and an input resolution of 256 256. The adversarial loss weight λG is set to 0.1, and the entropy loss weight λE is set to 0.05. We also adopt Le CAM regularization [52] for discriminator training to improve training stability.