Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Towards Unified Multimodal Interleaved Generation via Group Relative Policy Optimization

Authors: Ming Nie, Chunwei Wang, Jianhua Han, Hang Xu, Li Zhang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on MMIE and Interleaved Bench demonstrate that our approach significantly enhances the quality and coherence of multimodal interleaved generation.
Researcher Affiliation	Collaboration	Ming Nie1 Chunwei Wang2 Jianhua Han2 Hang Xu2 Li Zhang1 1School of Data Science, Fudan University 2Noah s Ark Lab, Huawei
Pseudocode	No	The paper describes its methodology in Section 3, detailing a warm-up training scheme and a GRPO-based optimization phase, including equations and descriptive text, but it does not present these steps in a structured pseudocode or algorithm block.
Open Source Code	Yes	https://github.com/Logos Robotics Group/Unified GRPO
Open Datasets	Yes	During the warm-up stage, we collect 0.3M interleaved text-image samples from Activity Net [1], Gen How To [23], and Open Story++ [37]. To preserve the model s original multimodal understanding and generation abilities, we further incorporate 1M multimodal understanding samples from EMOVA [2] and 1M text-to-image generation samples from Journey DB [25]. We evaluate the model s capability for interleaved multimodal generation on two dedicated benchmarks: MMIE[33] and Interleaved Bench[14].
Dataset Splits	No	During the warm-up stage, we collect 0.3M interleaved text-image samples from Activity Net [1], Gen How To [23], and Open Story++ [37]... For the GRPO stage, we curate a dataset of 0.1M samples (from the same sources as in the warm-up stage) focusing on visual storytelling and multimodal interleaved reasoning to facilitate effective policy optimization. The paper then states that evaluation is performed on MMIE and Interleaved Bench, which are external benchmarks.
Hardware Specification	Yes	All full-scale experiments are conducted using 32 NVIDIA A100 GPUs.
Software Dependencies	No	We implement VILA-U [32] as our foundation unified model. Thanks to its unified pretraining paradigm, which jointly learns multimodal comprehension and text-to-image generation, the model inherently possesses the potential for multimodal output. Our approach builds on this capability and unlocks interleaved generation with only minimal additional data.
Experiment Setup	Yes	All images are resized to a fixed resolution of 256 256 before being fed into the model. For image generation, we employ classifier-free guidance with a guidance scale of 3 to improve output fidelity. In GRPO stage, the number of generation G is set to 4 and we train the model for 3k steps.