Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

UniTok: a Unified Tokenizer for Visual Generation and Understanding

Authors: Chuofan Ma, Yi Jiang, Junfeng Wu, Jihan Yang, Xin Yu, Zehuan Yuan, BINGYUE PENG, Xiaojuan Qi

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through extensive experiments, we demonstrate that Uni Tok achieves comparable or even better performance to domain-specific tokenizers: On Image Net evaluation, Uni Tok records an impressive 0.38 reconstruction FID and 78.6% zero-shot accuracy at 256 256 resolution;
Researcher Affiliation Collaboration Chuofan Ma1,2 Yi Jiang2 Junfeng Wu2,3 Jihan Yang1 Xin Yu1 Zehuan Yuan2 Bingyue Peng2 Xiaojuan Qi1 1The University of Hong Kong 2Byte Dance Inc. 3Huazhong University of Science and Technology
Pseudocode No The paper only contains equations and diagrams, but no structured pseudocode or algorithm blocks are present.
Open Source Code Yes Git Hub: https://github.com/Foundation Vision/Uni Tok.
Open Datasets Yes We train the tokenizer for one epoch on the public dataset Data Comp-1B [9] consisting of 1.28B image-text pairs, with all images resized to 256 256 resolution and a global batch size of 16k. ... To provide a fair comparison with tokenizers trained on small datasets, we also train a version of Uni Tok on Open Images [17] solely with reconstruction supervision. ... composed of 10M language data from DCLM [22], 30M internal Mid Journey-style synthetic data, and 30M re-captioned image-text pairs from COYO [32] and Laion [41].
Dataset Splits No The paper mentions using Data Comp-1B, Open Images, COYO, Laion for training, and evaluating on ImageNet validation sets. While ImageNet has standard splits, the specific training/validation/test splits for the large mixed datasets like Data Comp-1B, DCLM, Mid Journey-style synthetic data, COYO, and Laion used for tokenizer and MLLM training are not explicitly detailed in terms of percentages or counts in the main text, nor does it refer to predefined splits for these combined datasets.
Hardware Specification Yes It takes roughly 50 hours with the equivalent computing power of 256 A100 GPUs to train the tokenizer.
Software Dependencies No The paper mentions using the Llama-2-7B base model [52] and integrating with the Llama Gen [43] framework. However, it does not explicitly list specific version numbers for key software dependencies like programming languages (e.g., Python), deep learning frameworks (e.g., PyTorch, TensorFlow), or CUDA libraries.
Experiment Setup Yes We train the tokenizer for one epoch on the public dataset Data Comp-1B [9] consisting of 1.28B image-text pairs, with all images resized to 256 256 resolution and a global batch size of 16k. The learning rate is set to 1e-3 for the tokenizer and 2e-4 for the discriminator. ... the learning rate is set to 5e-5 in the pretraining stage and 2e-5 in the finetuning stage.