Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Vision as a Dialect: Unifying Visual Understanding and Generation via Text-Aligned Representations
Authors: Jiaming Han, Hao Chen, Yang Zhao, Hanyu Wang, Qi Zhao, Ziyan Yang, Hao He, Xiangyu Yue, Lu Jiang
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments across benchmarks show that Tar matches or surpasses existing multimodal LLM methods, achieving faster convergence and greater training efficiency. Experiments across benchmarks show that Tar matches or surpasses existing multimodal LLM methods, achieving faster convergence and greater training efficiency. Code, models, and data are available at https://tar.csuhan.com |
| Researcher Affiliation | Collaboration | Jiaming Han12, Hao Chen2 , Yang Zhao2, Hanyu Wang2, Qi Zhao2, Ziyan Yang2, Hao He12, Xiangyu Yue1 , Lu Jiang2 1CUHK MMLab 2Byte Dance Seed |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. Methods are described through textual explanations and diagrams (e.g., Figure 2, Figure 3). |
| Open Source Code | Yes | Code, models, and data are available at https://tar.csuhan.com |
| Open Datasets | Yes | Our training data consists of image, text and multimodal datasets. Since open-sourced datasets for image, text and image-to-text tasks are widely available [13, 53, 60], our focus is on curating high-quality data for image generation. The pipeline includes: (1) Image caption. We use Qwen2.5-VL [3] to generate rich, detailed captions for general image datasets [13, 15, 51]. (2) Synthetic image generation. We adopt FLUX [27] to generate high quality images based on real user prompts [55, 66] and image captions from Step 1, which yield diverse, prompt-aligned content. In total, we curate a dataset of 23M high-quality text-image pairs for training. TA-Tok is trained on both 100M raw web images and 100M aesthetic-filtered images from LAION-5B [53] to balance its ability on encoding general images for understanding and high-quality images for generation. |
| Dataset Splits | Yes | Our LLM is trained on a diverse mix of data types, including standard image-to-text (I T), text-to-image (T I) and text-only (T T) tasks. To further bridge the gap between visual understanding and generation, we introduce two additional task type: text-image-to-image (TI I) and image-to-image (I I). For visual understanding, we use open-source instruction tuning datasets from LLa VA-v1.5 [38] and LLa VA-Next [39]. For training MLLMs with these representations, we sample a subset of our training data for controlled experiments: 10M T2I data, 10M I2T data and 5M text-only data. To enable Scale-Adaptive Pooling and Decoding, we random select a scale from {1, 2, 3}, resulting in {729, 169, 81} tokens. Since learning longer sequence is usually harder, we set the sampling ratio of scale {1, 2, 3} to (2 : 1 : 1). The training of De-Tokenizers and MLLM also follows the same sampling ratio. |
| Hardware Specification | No | The paper mentions that "We provide sufficient information on the computational cost in the appendix (Sec. E)" in the NeurIPS checklist, but Appendix E (Additional Ablation Experiments) details experimental results without specifying hardware components like specific GPU or CPU models used for the experiments. |
| Software Dependencies | No | The paper mentions several models and architectures like "siglip2-so400m-patch14-384 [64]", "LLaMA architecture [63] implemented in Llamagen [56]", "pretrained SANA-0.6B [74]", and "Qwen2.5-Instruct [78]". However, it does not explicitly list specific software dependencies such as programming languages (e.g., Python), frameworks (e.g., PyTorch), or libraries with their version numbers. |
| Experiment Setup | Yes | We list the training hyper-parameters in Tab. 8. Table 8: Training Parameters. config TA-Tok AR-DTok Dif-DTok MLLM 256px 512px 1024px 512px Prertain SFT learning rate 2e-4 4e-4 1e-4 1e-4 1e-4 5e-5 lr schedule cosine cosine constant consine optimizer Adam W Adam W CAME Adam W optimizer params β1=0.9,β2=0.99 β1=0.9,β2=0.95 β1=0.9,β2=0.999 β3=0.9999 β1=0.9,β2=0.999 weight decay 1e-4 0.05 0.0 0.0 input resolution 384 256 512 1024 512 384 warmup epochs 0.04 0.04 0.01 0.03 epochs 1 1 1 1 total samples 200M 50M 23M 3M 23M 100M 4M total batch size 512 768 96 48 48 1024 256 codebook loss 1.0 reconstruction loss 1.0 gradient clip 1.0 1.0 0.1 1.0 token drop prob 0.1 0.1 data ratio 2(und):2(gen):1(text) |