Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Vision Foundation Models as Effective Visual Tokenizers for Autoregressive Generation

Authors: Zheng Anlin, Xin Wen, Xuanyang Zhang, Chuofan Ma, Tiancai Wang, Gang Yu, Xiangyu Zhang, Xiaojuan Qi

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments validate that VFMTok, by combining the representational power of visual foundation models with a novel region-adaptive tokenization strategy based on irregular sampling and learnable anchor queries, enables both high-quality and efficient image reconstruction and autoregressive (AR) generation. First, VFMTok achieves superior reconstruction quality and captures richer semantics using significantly fewer tokens compared to prior methods (e.g., 256 vs. 576 in [42]), resulting in a structured, semantic-aware, and compact latent space. As shown in Tab. 1, VFMTok, with only 256 tokens, outperforms other tokenizers using the same VFM encoder by delivering superior reconstruction quality and stronger semantic representation (as indicated by linear probing). Second, the high-quality latent space produced by VFMTok facilitates effective AR training using a simple LLa MA-based model, leading to faster convergence (see Fig. 1(b)) and improved generation performance. Notably, the 1.4B AR model surpasses the performance of Llama Gen-3B despite having fewer parameters and requiring fewer training iterations. The 1.5B advanced AR model achieves a new state-of-the-art with a g FID of 1.36 on Image Net [10] 256 256, outperforming widely-used diffusion models. Third, due to the compact token space and the reduced number of tokens, VFMTok significantly improves the inference speed of AR models (see Tab. 1).
Researcher Affiliation Collaboration Anlin Zheng1 Xin Wen1 Xuanyang Zhang2 Chuofan Ma1 Tiancai Wang3 Gang Yu2 Xiangyu Zhang2,4 Xiaojuan Qi1 1The University of Hong Kong 2Step Fun 3Dexmal 4MEGVII Technology
Pseudocode No The paper describes the methods and procedures in narrative text and figures (Figure 1 and Figure 2 show architectural diagrams) but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes The code is available at https://github.com/CVMI-Lab/VFMTok.
Open Datasets Yes The model is trained on the Image Net [10] training set and evaluated on its validation set.
Dataset Splits Yes The model is trained on the Image Net [10] training set and evaluated on its validation set.
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications. It mentions computational costs but without concrete hardware information.
Software Dependencies No The paper mentions software components like "Transformer" and "Llama-based AR model" but does not specify version numbers for these or any other libraries or frameworks used in the implementation.
Experiment Setup Yes Consistent with [42, 55], we set the codebook vector dimension of the quantizer to 12 with a codebook size of 16384, to achieve a better reconstruction quality and efficient codebook utilization. Meanwhile, VFMTok utilizes 256 tokens to represent an image. Besides, the depth of the Transformer is set to 6 (following [62]). The model is trained on the Image Net [10] training set and evaluated on its validation set. Models with fewer than 1B parameters are trained for 300 epochs, while the remaining models are trained for 200 epochs. In our experiments, we set both α and λ to 1.