Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

LMFusion: Adapting Pretrained Language Models for Multimodal Generation

Authors: Weijia Shi, Xiaochuang Han, Chunting Zhou, Weixin Liang, Xi Lin, Luke Zettlemoyer, LILI YU

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To evaluate the effectiveness of our approach, we conduct comprehensive experiments comparing LMFusion with Transfusion in controlled settings. Specifically, we initialize our LMFusion architecture with a pretrained Llama-3 8B model [8] and continue training on the same image data as in Transfusion [4]. Compared to Transfusion, LMFusion achieves a 20% improvement in image understanding and 3.6% improvement in image generation. It also preserves Llama-3 s text-only performance that outperforms Transfusion by 11.6%. Figure 2 presents images generated by LMFusion.
Researcher Affiliation	Collaboration	Weijia Shi w Xiaochuang Han w Chunting Zhouf Weixin Liangs Xi Victoria Linf Luke Zettlemoyerwf Lili Yuf w University of Washington f FAIR at Meta s Stanford University
Pseudocode	No	The paper describes the model architecture and training process using descriptive text and mathematical equations (5-14) in Section 3.1, but it does not include a distinct section labeled 'Pseudocode' or 'Algorithm', nor are there structured code-like blocks presented.
Open Source Code	Yes	Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: Data and code will be open-sourced with instructions
Open Datasets	Yes	Data Following Transfusion [4], we use the same collection of 380M Shutterstock image-caption data, where each image is center-cropped and resized to 256 256 pixels.
Dataset Splits	Yes	We order the captions before images (i.e., emphasizing image generation conditioned on texts) 80% of the time, and order the images before captions for the rest.
Hardware Specification	Yes	The model is trained using 128 H100 GPUs over 4 days.
Software Dependencies	No	The paper mentions specific models like 'Llama-3 8B model' and optimizers like 'Adam W optimizer' but does not provide specific version numbers for any software libraries or frameworks used (e.g., Python, PyTorch, CUDA versions).
Experiment Setup	Yes	Optimization In our main experiments, to preserve the language-only performance, we freeze the text modules (ηtext = 0) while training only the image modules using an Adam W optimizer (β1 = 0.9, β2 = 0.95, ϵ = 1 10 8) with a learning rate ηimage = 1 10 4. The learning rate follows a cosine decay schedule with a 4000-step warmup period before gradually decreasing to 1.5 10 5.