Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
LMFusion: Adapting Pretrained Language Models for Multimodal Generation
Authors: Weijia Shi, Xiaochuang Han, Chunting Zhou, Weixin Liang, Xi Lin, Luke Zettlemoyer, LILI YU
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To evaluate the effectiveness of our approach, we conduct comprehensive experiments comparing LMFusion with Transfusion in controlled settings. Specifically, we initialize our LMFusion architecture with a pretrained Llama-3 8B model [8] and continue training on the same image data as in Transfusion [4]. Compared to Transfusion, LMFusion achieves a 20% improvement in image understanding and 3.6% improvement in image generation. It also preserves Llama-3 s text-only performance that outperforms Transfusion by 11.6%. Figure 2 presents images generated by LMFusion. |
| Researcher Affiliation | Collaboration | Weijia Shi w Xiaochuang Han w Chunting Zhouf Weixin Liangs Xi Victoria Linf Luke Zettlemoyerwf Lili Yuf w University of Washington f FAIR at Meta s Stanford University |
| Pseudocode | No | The paper describes the model architecture and training process using descriptive text and mathematical equations (5-14) in Section 3.1, but it does not include a distinct section labeled 'Pseudocode' or 'Algorithm', nor are there structured code-like blocks presented. |
| Open Source Code | Yes | Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: Data and code will be open-sourced with instructions |
| Open Datasets | Yes | Data Following Transfusion [4], we use the same collection of 380M Shutterstock image-caption data, where each image is center-cropped and resized to 256 256 pixels. |
| Dataset Splits | Yes | We order the captions before images (i.e., emphasizing image generation conditioned on texts) 80% of the time, and order the images before captions for the rest. |
| Hardware Specification | Yes | The model is trained using 128 H100 GPUs over 4 days. |
| Software Dependencies | No | The paper mentions specific models like 'Llama-3 8B model' and optimizers like 'Adam W optimizer' but does not provide specific version numbers for any software libraries or frameworks used (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | Optimization In our main experiments, to preserve the language-only performance, we freeze the text modules (ηtext = 0) while training only the image modules using an Adam W optimizer (β1 = 0.9, β2 = 0.95, ϵ = 1 10 8) with a learning rate ηimage = 1 10 4. The learning rate follows a cosine decay schedule with a 4000-step warmup period before gradually decreasing to 1.5 10 5. |