Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

MMaDA: Multimodal Large Diffusion Language Models

Authors: Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, Mengdi Wang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate that MMa DA-8B exhibits strong generalization capabilities as a unified multimodal foundation model.
Researcher Affiliation Collaboration 1Princeton University 2Peking University 3Tsinghua University 4Byte Dance Seed
Pseudocode Yes Algorithm 1 Uni GRPO Policy Gradient Optimization
Open Source Code Yes We open-source our code and trained models at: https://github.com/Gen-Verse/MMa DA
Open Datasets Yes Datasets To train MMa DA, we utilized a diverse range of datasets tailored for corresponding training stages as follows: (1) Foundational Language and Multimodal Data: For basic text generation capabilities, we adopt the Refined Web [43] dataset. For multimodal understanding and generation tasks, we incorporate image-text datasets including Image Net-1k [44], CC12M [45], SA1B [46], LAION-aesthetics-12M [47], and Journey DB [48]. (2) Instruction Tuning Data: To enhance instruction-following capabilities, we use Alpaca [49] for textual instructions and LLa VA-1.5 [38] for visual instruction tuning. (3) Reasoning Data: For Mixed Long-Co T finetuning, we curated a diverse set of reasoning datasets. For textual mathematical and logical reasoning, we employed LIMO [50], s1k [51], Open Thoughts [52], and Ace Math-Instruct [53]. For multimodal reasoning, we used the LMM-R1 [54] model to generate responses on Geo QA [36] and CLEVR [37], and retained correctly answered instances. Additionally, for world knowledge-aware image generation, we used GPT-4.1 to synthesize factual item-description pairs spanning science, culture, and landmarks, formatted into unified Co T-style traces.(4) Reinforcement Learning Data: For Uni GRPO training, we adopt the original mathematical and logical datasets used in Reasoning [55, 36, 37].
Dataset Splits Yes Evaluation and Baselines We evaluate our MMa DA on three distinct tasks using task-specific metrics and baselines:(1) Multimodal Understanding: Following LLa VA [38], we evaluate on POPE, MME, Flickr30k, VQAv2, GQA, and MMMU... (2) Image Generation: We assess generation quality using 50K prompts from our test set to compute CLIP Score [56] and Image Reward [57]... (4) Reinforcement Learning Data: For Uni GRPO training, we adopt the original mathematical and logical datasets used in Reasoning [55, 36, 37].
Hardware Specification Yes Training is performed on 64 A100 (80GB) GPUs using a global batch size of 1,280.
Software Dependencies No The paper mentions models, architectures, and optimizers (e.g., LLa DA tokenizer, MAGVIT-v2 architecture, Adam W optimizer), but does not provide specific version numbers for software dependencies like programming languages or libraries.
Experiment Setup Yes Implementation Details We initialize MMa DA with LLa DA-8B-Instruct s pretrained weights [22] and an image tokenizer with Show-o s pretrained ones. We perform joint training across three stages: Stage1: The initial model is trained for 200K steps using foundational language and multimodal data, including Refined Web for text generation, Image Net-1k for class-conditional image generation, and additional image-text datasets for captioning. This is followed by another 400K steps where Image Net is replaced with more diverse image-text pairs. Stage2: The model is then jointly trained for 50,000 steps using Instruction Tuning Data and Reasoning Data. Stage3: This final stage consists of Uni GRPO training with Reinforcement Learning Data for 50,000 steps. Training is performed on 64 A100 (80GB) GPUs using a global batch size of 1,280. The Adam W optimizer is employed with an initial learning rate of 5e-5 and a cosine learning rate scheduler. ... The denoising process consists of 50 timesteps, and we apply classifier-free guidance with a guidance scale set to 3.5.