Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Show-o2: Improved Native Unified Multimodal Models
Authors: Jinheng Xie, Zhenheng Yang, Mike Zheng Shou
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experimental results have demonstrated that our model surpasses the existing methods in terms of most metrics across multimodal understanding and visual generation benchmarks. Collectively, the main contributions of this paper can be summarized as: The proposed model demonstrates state-of-the-art performance on multimodal understanding and visual generation benchmarks, surpassing existing methods across various metrics. |
| Researcher Affiliation | Collaboration | Jinheng Xie1 Zhenheng Yang2 Mike Zheng Shou1 1 Show Lab, National University of Singapore 2 Byte Dance |
| Pseudocode | No | The paper describes its methodology using text, equations, and figures, but does not include any explicit pseudocode blocks or algorithm listings. |
| Open Source Code | Yes | Code and models are released at https://github.com/showlab/Show-o. We release the training and inference code at https://github.com/showlab/Show-o and most of the training data is publicly available. |
| Open Datasets | Yes | The curated approximately 66M image-text pairs consist of images with a resolution of at least 512 pixels in width and height. The images are filtered from CC12M [14], COYO [12], LAION-Aesthetic-12M and AI synthetic data. The 9M high-quality multimodal understanding instruction data is curated from Densefusion-1M [63], and LLa VA-One Vision [56]. |
| Dataset Splits | No | The paper describes the quantity of data used for training in different stages (e.g., "66M image-text pairs", "9M high-quality multimodal understanding instruction data", "16M high-quality visual generation data", "1.6M video understanding data") and mentions using standard benchmarks for evaluation. However, it does not explicitly provide specific train/test/validation splits for its own curated datasets or details on how data was partitioned for internal evaluations beyond referencing existing benchmarks' evaluation settings. |
| Hardware Specification | Yes | This training process roughly takes one and a half days using 64 H100 GPUs. The whole training process of our 7B model takes approximately 2 and a half days using 128 H100 GPUs. |
| Software Dependencies | No | The paper mentions models like Sig LIP-so400m-patch14-384 and Qwen2.5-1.5B-Instruct/Qwen2.5-7B-Instruct, but does not provide specific version numbers for general software dependencies such as programming languages, deep learning frameworks (e.g., PyTorch, TensorFlow), or CUDA versions. |
| Experiment Setup | Yes | The semantic layers S( ) are pre-distilled from Sig LIP-so400m-patch14-384 over 200K iterations, using a batch size of 512 and a cosine-scheduled learning rate of 2e-5. In the first stage, we train these components using autoregressive modeling and flow matching using around 66M image-text pairs. The context length of single image-text pairs is set as 1024. The total batch sizes for multimodal understanding and generation are 128 and 384, respectively. α in Eq. 4 is set as 0.2. For visual generation data, the caption is dropped with a probability of 0.1 to enable the classifier-free guidance. We follow the training strategies in LLa VA-One Vision [56] to train the 1.5B model using around 9M multimodal instructional and 16M high-quality generation data for a total of around 35K iterations. α in Eq. 4 is set as 1.0. |