Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
UniGen: Enhanced Training & Test-Time Strategies for Unified Multimodal Understanding and Generation
Authors: Rui Tian, Mingfei Gao, Mingze Xu, Jiaming Hu, Jiasen Lu, Zuxuan Wu, Yinfei Yang, Afshin Dehghan
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We introduce Uni Gen, a unified multimodal large language model (MLLM) capable of image understanding and generation. We study the full training pipeline of Uni Gen from a data-centric perspective, including multi-stage pre-training, supervised fine-tuning, and direct preference optimization. More importantly, we propose a new Chain-of-Thought Verification (Co T-V) strategy for test-time scaling, which significantly boosts Uni Gen s image generation quality using a simple Best-of-N test-time strategy. Specifically, Co T-V enables Uni Gen to act as both image generator and verifier at test time, assessing the semantic alignment between a text prompt and its generated image in a step-by-step Co T manner. Trained entirely on open-source datasets across all stages, Uni Gen achieves state-of-the-art performance on a range of image understanding and generation benchmarks, with a final score of 0.78 on GENEVAL and 85.19 on DPG-BENCH. Through extensive ablation studies, our work provides actionable insights and addresses key challenges in the full life cycle of building unified MLLMs, contributing meaningful directions to future research. |
| Researcher Affiliation | Collaboration | Rui Tian1 2 , Mingfei Gao2 , Mingze Xu2 , Jiaming Hu2, Jiasen Lu2, Zuxuan Wu1 , Yinfei Yang2, Afshin Dehghan2 1Institute of Trustworthy Embodied AI, Fudan University 2Apple |
| Pseudocode | No | The paper describes the methodology using textual explanations and architectural diagrams (e.g., Figure 2, Figure 3, Figure 4) but does not include explicit pseudocode blocks or algorithms. |
| Open Source Code | Yes | Code is available at https://github.com/apple/ml-unigen. |
| Open Datasets | Yes | Trained entirely on open-source datasets across all stages, Uni Gen achieves state-of-the-art performance on a range of image understanding and generation benchmarks... Unlike state-of-the-art models [9, 47, 87, 84] that rely on large-scale internal datasets, we curate new data mixtures across training stages by using only open-source images. We show that models trained on publicly available data can also achieve competitive results. ... Pre-training Data. We generate fine-grained captions for images from Image Net [62] , CC-3M [63], CC-12M [6] and SAM-11M [31] dataset using Qwen2.5-VL-7B [3] to form a 40M image-text pair corpus. For text-only pre-training, we use Refined Web [55]. ... For image understanding, we adopt the strong image mixture from Slow Fast-LLa VA-1.5 [90], which was carefully curated from open-source datasets with 4.67M multimodal VQA samples. For image generation, prior work [9] uses high-quality synthetic data that can enable fast and robust training convergence. We share this observation by using the Journey DB [64] and text-2-image-2M [28] to improve the aesthetic quality of our generated images. We name the model trained in this stage as Uni Gen-SFT. |
| Dataset Splits | No | The paper references various datasets and benchmarks, some of which imply standard splits are used (e.g., 'T2I-Comp [25] training set' or evaluation on GENEVAL and DPG-BENCH). However, it does not explicitly state specific training/validation/test splits (e.g., percentages, sample counts, or explicit standard split names) for the primary model training or evaluation datasets within the provided text. |
| Hardware Specification | Yes | We use 32 H100-80G GPUs for pre-training stages and 8 H100-80G GPUs for the others. |
| Software Dependencies | No | The paper mentions several models and toolkits used, such as 'Uni Gen is built upon the pre-trained Qwen2.5-1.5B [91]', 'MAGVITv2 from Show-o [87]', 'Sig LIP [100]', 'Qwen2.5-7B [91]', 'Qwen2.5VL-7B', 'lmms-eval1 toolkit', and 'official evaluation repository of GENEVAL2 and DPG-BENCH3'. While these components are named, specific version numbers for general software dependencies like Python, PyTorch, or CUDA, which are crucial for full reproducibility, are not provided. |
| Experiment Setup | Yes | Detailed hyperparameters for each training stage are described in Appendix Table 17 with more details in Appendix Sec. E.0.2. Table 17: Hyperparameter setup for different training stages of Uni Gen. Data ratio refers to the ratio of image understanding data, pure text data, and image generation data. Hyperparameters: Learning rate, LR scheduler, Weight decay, Gradient clip, Optimizer, Warm-up steps, Training steps, H100 hours, Batch size, Data ratio. |