Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Towards Unified Multimodal Interleaved Generation via Group Relative Policy Optimization
Authors: Ming Nie, Chunwei Wang, Jianhua Han, Hang Xu, Li Zhang
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on MMIE and Interleaved Bench demonstrate that our approach significantly enhances the quality and coherence of multimodal interleaved generation. |
| Researcher Affiliation | Collaboration | Ming Nie1 Chunwei Wang2 Jianhua Han2 Hang Xu2 Li Zhang1 1School of Data Science, Fudan University 2Noah s Ark Lab, Huawei |
| Pseudocode | No | The paper describes its methodology in Section 3, detailing a warm-up training scheme and a GRPO-based optimization phase, including equations and descriptive text, but it does not present these steps in a structured pseudocode or algorithm block. |
| Open Source Code | Yes | https://github.com/Logos Robotics Group/Unified GRPO |
| Open Datasets | Yes | During the warm-up stage, we collect 0.3M interleaved text-image samples from Activity Net [1], Gen How To [23], and Open Story++ [37]. To preserve the model s original multimodal understanding and generation abilities, we further incorporate 1M multimodal understanding samples from EMOVA [2] and 1M text-to-image generation samples from Journey DB [25]. We evaluate the model s capability for interleaved multimodal generation on two dedicated benchmarks: MMIE[33] and Interleaved Bench[14]. |
| Dataset Splits | No | During the warm-up stage, we collect 0.3M interleaved text-image samples from Activity Net [1], Gen How To [23], and Open Story++ [37]... For the GRPO stage, we curate a dataset of 0.1M samples (from the same sources as in the warm-up stage) focusing on visual storytelling and multimodal interleaved reasoning to facilitate effective policy optimization. The paper then states that evaluation is performed on MMIE and Interleaved Bench, which are external benchmarks. |
| Hardware Specification | Yes | All full-scale experiments are conducted using 32 NVIDIA A100 GPUs. |
| Software Dependencies | No | We implement VILA-U [32] as our foundation unified model. Thanks to its unified pretraining paradigm, which jointly learns multimodal comprehension and text-to-image generation, the model inherently possesses the potential for multimodal output. Our approach builds on this capability and unlocks interleaved generation with only minimal additional data. |
| Experiment Setup | Yes | All images are resized to a fixed resolution of 256 256 before being fed into the model. For image generation, we employ classifier-free guidance with a guidance scale of 3 to improve output fidelity. In GRPO stage, the number of generation G is set to 4 and we train the model for 3k steps. |