Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
Authors: Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, Mike Zheng Shou
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Across various benchmarks, it demonstrates comparable or superior performance to existing individual models with an equivalent or larger number of parameters tailored for understanding or generation. This significantly highlights its potential as a next-generation foundation model. Quantitatively, Show-o demonstrates comparable even better performance to individual models with an equivalent or larger number of parameters across benchmarks. In contrast to autoregressively generating an image, Show-o requires approximately 20 times fewer sampling steps, exhibiting inherent potential in acceleration. Besides, as shown in Fig. 2, Show-o naturally supports various downstream applications like text-guided inpainting and extrapolation, without any fine-tuning. Moreover, we have demonstrated that Show-o has the potential for mixed-modality generation like interleaved video keyframe generation with text descriptions, video understanding, and video generation. This demonstrates the potential of the unified model as a feasible paradigm for long-form video understanding and generation. Beyond, we investigate the impact of dataset scale, image resolution, and different types of image representations (discrete or continuous) on the multimodal understanding performance, presenting systematic insights for the design of a unified model in the future. |
| Researcher Affiliation | Collaboration | Jinheng Xie1 Weijia Mao1 Zechen Bai1 David Junhao Zhang1 Weihao Wang2 Kevin Qinghong Lin1 Yuchao Gu1 Zhijie Chen2 Zhenheng Yang2 Mike Zheng Shou1 1 Show Lab, National University of Singapore 2 Byte Dance |
| Pseudocode | No | The paper describes its methodology in Section 3 and details components like Tokenization, Architecture, Unified Prompting, Omni-Attention Mechanism, and Training Objectives with mathematical formulations. However, it does not present any explicitly labeled 'Pseudocode' or 'Algorithm' blocks with step-by-step instructions for a procedure. |
| Open Source Code | Yes | Code and models are released at https://github.com/showlab/Show-o. |
| Open Datasets | Yes | We assemble two scales of datasets, i.e., around 35M and 2.0B image-text pairs, and collect around 2M high-quality data for multimodal understanding and generation fine-tuning. Besides, Refined Web (Penedo et al., 2023) is adopted as text corpora to maintain the language modeling capability. Appendix E provides more details about these datasets. [...] We employ the publicly available Refined Web dataset (Penedo et al., 2023)... Image Net-1K dataset (Deng et al., 2009)... publicly available datasets including CC12M (Changpinyo et al., 2021), SA1B (Kirillov et al., 2023), and LAION-aesthetics-12M... Data Comp (Gadre et al., 2024) and COYO700M (Byeon et al., 2022)... Share GPT4V (Chen et al., 2023)... LLa VA-v1.5 (Liu et al., 2024b), we incorporate LLa VA-Pretrain558K and LLa VA-v1.5-mix-665K... The Gen How To dataset (Souˇcek et al., 2024) is utilized for mixed-modality generation. |
| Dataset Splits | Yes | Following LLa VA (Liu et al., 2024b), we evaluate the multimodal understanding capabilities of Show-o on POPE, MME, Flickr30k, VQAv2, GQA, and MMMU benchmarks. Besides, we adopt Fr echet Inception Distance (FID) on MSCOCO dataset to evaluate the generation fidelity of Show-o. Further, we follow SD3 (Esser et al., 2024) to evaluate the text-to-image generation capabilities of Show-o on the Gen Eval (Ghosh et al., 2023) benchmark. |
| Hardware Specification | Yes | The base model is trained on 48 A100 (80GB) GPUs with a total batch size of 1,152. |
| Software Dependencies | No | The paper mentions the use of 'Adam W optimizer' but does not specify any software libraries or frameworks (e.g., PyTorch, TensorFlow) along with their version numbers. |
| Experiment Setup | Yes | The base model is trained on 48 A100 (80GB) GPUs with a total batch size of 1,152. We employ the Adam W optimizer with a weight decay of 0.01, 5,000 steps of warm-up, and an initial learning rate of 1e-4 with a cosine scheduling. Finally, we fine-tune Show-o with around 1M internal high-quality image-text pairs and adhere to the configuration of LLa VA-v1.5 for instruction data tuning. |