Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

Authors: Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, Mike Zheng Shou

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Across various benchmarks, it demonstrates comparable or superior performance to existing individual models with an equivalent or larger number of parameters tailored for understanding or generation. This significantly highlights its potential as a next-generation foundation model. Quantitatively, Show-o demonstrates comparable even better performance to individual models with an equivalent or larger number of parameters across benchmarks. In contrast to autoregressively generating an image, Show-o requires approximately 20 times fewer sampling steps, exhibiting inherent potential in acceleration. Besides, as shown in Fig. 2, Show-o naturally supports various downstream applications like text-guided inpainting and extrapolation, without any fine-tuning. Moreover, we have demonstrated that Show-o has the potential for mixed-modality generation like interleaved video keyframe generation with text descriptions, video understanding, and video generation. This demonstrates the potential of the unified model as a feasible paradigm for long-form video understanding and generation. Beyond, we investigate the impact of dataset scale, image resolution, and different types of image representations (discrete or continuous) on the multimodal understanding performance, presenting systematic insights for the design of a unified model in the future.
Researcher Affiliation	Collaboration	Jinheng Xie1 Weijia Mao1 Zechen Bai1 David Junhao Zhang1 Weihao Wang2 Kevin Qinghong Lin1 Yuchao Gu1 Zhijie Chen2 Zhenheng Yang2 Mike Zheng Shou1 1 Show Lab, National University of Singapore 2 Byte Dance
Pseudocode	No	The paper describes its methodology in Section 3 and details components like Tokenization, Architecture, Unified Prompting, Omni-Attention Mechanism, and Training Objectives with mathematical formulations. However, it does not present any explicitly labeled 'Pseudocode' or 'Algorithm' blocks with step-by-step instructions for a procedure.
Open Source Code	Yes	Code and models are released at https://github.com/showlab/Show-o.
Open Datasets	Yes	We assemble two scales of datasets, i.e., around 35M and 2.0B image-text pairs, and collect around 2M high-quality data for multimodal understanding and generation fine-tuning. Besides, Refined Web (Penedo et al., 2023) is adopted as text corpora to maintain the language modeling capability. Appendix E provides more details about these datasets. [...] We employ the publicly available Refined Web dataset (Penedo et al., 2023)... Image Net-1K dataset (Deng et al., 2009)... publicly available datasets including CC12M (Changpinyo et al., 2021), SA1B (Kirillov et al., 2023), and LAION-aesthetics-12M... Data Comp (Gadre et al., 2024) and COYO700M (Byeon et al., 2022)... Share GPT4V (Chen et al., 2023)... LLa VA-v1.5 (Liu et al., 2024b), we incorporate LLa VA-Pretrain558K and LLa VA-v1.5-mix-665K... The Gen How To dataset (Souˇcek et al., 2024) is utilized for mixed-modality generation.
Dataset Splits	Yes	Following LLa VA (Liu et al., 2024b), we evaluate the multimodal understanding capabilities of Show-o on POPE, MME, Flickr30k, VQAv2, GQA, and MMMU benchmarks. Besides, we adopt Fr echet Inception Distance (FID) on MSCOCO dataset to evaluate the generation fidelity of Show-o. Further, we follow SD3 (Esser et al., 2024) to evaluate the text-to-image generation capabilities of Show-o on the Gen Eval (Ghosh et al., 2023) benchmark.
Hardware Specification	Yes	The base model is trained on 48 A100 (80GB) GPUs with a total batch size of 1,152.
Software Dependencies	No	The paper mentions the use of 'Adam W optimizer' but does not specify any software libraries or frameworks (e.g., PyTorch, TensorFlow) along with their version numbers.
Experiment Setup	Yes	The base model is trained on 48 A100 (80GB) GPUs with a total batch size of 1,152. We employ the Adam W optimizer with a weight decay of 0.01, 5,000 steps of warm-up, and an initial learning rate of 1e-4 with a cosine scheduling. Finally, we fine-tune Show-o with around 1M internal high-quality image-text pairs and adhere to the configuration of LLa VA-v1.5 for instruction data tuning.