Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
FUDOKI: Discrete Flow-based Unified Understanding and Generation via Kinetic-Optimal Velocities
Authors: Jin Wang, Yao Lai, Aoxue Li, Shifeng Zhang, Jiacheng Sun, Ning Kang, Chengyue Wu, Zhenguo Li, Ping Luo
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results show that FUDOKI achieves performance comparable to state-of-the-art AR-based MLLMs across both visual understanding and image generation tasks, highlighting its potential as a foundation for next-generation unified multimodal models. |
| Researcher Affiliation | Collaboration | Jin Wang ,1 Yao Lai ,1 Aoxue Li2 Shifeng Zhang2 Jiacheng Sun2 Ning Kang2 Chengyue Wu1 Zhenguo Li ,2 Ping Luo ,1 1The University of Hong Kong 2Huawei Noah s Ark Lab |
| Pseudocode | No | The paper describes processes and equations, such as Definition 1 and Eq. 1-5, and outlines the training and inference steps, but it does not present a clearly labeled 'Pseudocode' or 'Algorithm' block. |
| Open Source Code | No | We will release the code and model weight when the paper is accepted. |
| Open Datasets | Yes | In both training stages, we use approximately 13M supervised finetuning data to learn our FUDOKI, including 9M in-house generation data for text-to-image generation and 4M public understanding data, which covers various aspects including OCR [61, 62], doc [63], chart [64], screen [65], math [66, 67], language [68], etc. This is less than Chameleon s 1.4B data [54] and LWM s 1B data [69]. We leave the detailed dataset collections in the appendix. For text generation, the sequence length for the response is set to 500, while for image generation, it is set to 576 to match the input size of the image encoder. |
| Dataset Splits | No | The paper states: "Our training set comprises a total of 12.62 million samples, divided into two main categories: Generation (8.76M, 69%) and Understanding (3.86M, 31%)". It also mentions evaluating on several benchmarks. However, it does not explicitly provide details about training, validation, and test splits for the 12.62 million samples, nor does it specify the splits used for the listed benchmarks beyond general evaluation. |
| Hardware Specification | No | The entire training process spanned approximately 43,000 GPU hours. |
| Software Dependencies | No | The paper mentions using a 'tokenizer with a vocabulary size of 102, 400', 'Llama Gen [60]', and 'Sig LIP [59]', but it does not specify concrete version numbers for these or other software libraries/dependencies (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | For text generation, the sequence length for the response is set to 500, while for image generation, it is set to 576 to match the input size of the image encoder. The text embeddings for calculating the metric distance function d( , ) are taken from the original embedding layer of Janus-Pro-7B [26] and the image embeddings are obtained from the codebook of Llama Gen [60]. We set Îēt = c t 1 t a with c = 3 and a = 0.9, as suggested in [38]. Besides, following previous studies [45, 44], for the text modality, we pad each sequence with <eos> (end-of-sequence) and <pad> tokens to the maximum length during training, and compute the loss over model s answer tokens, including these special tokens. After the sampling process, we only keep the model responses ahead of the first <eos> token. The sampling iterations are set as 32 by default, and the resolution of generated images by FUDOKI is 384 384. |