Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Bifrost-1: Bridging Multimodal LLMs and Diffusion Models with Patch-level CLIP Latents
Authors: Han Lin, Jaemin Cho, Amir Zadeh, Chuan Li, Mohit Bansal
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments demonstrate that BIFROST-1 achieves comparable or better performance than previous methods in terms of visual fidelity and multimodal understanding, with substantially lower compute during training. We also provide comprehensive ablation studies showing the effectiveness of our design choices. Project page: https://bifrost-1.github.io. |
| Researcher Affiliation | Collaboration | Han Lin1 Jaemin Cho1 Amir Zadeh2 Chuan Li2 Mohit Bansal1 1UNC Chapel Hill 2Lambda |
| Pseudocode | No | The paper describes the model architecture and processes using mathematical equations and textual descriptions in Section A 'Formal Description of the BIFROST-1 MLLM Architecture', but does not present structured pseudocode or an algorithm block. |
| Open Source Code | Yes | We include our code in the supplimentary material. Project page: https://bifrost-1.github.io. |
| Open Datasets | Yes | We train BIFROST-1 on Image Net [16] for the experiments in Sec. 5.1 and Sec. 5.2...Our models with Qwen2.5-VL 3B/7B are trained on 9M and 62M images respectively from the BLIP3-o [8] training dataset...For open prompt evaluation, we follow previous works [10, 51, 63, 59] and report FID scores on MJHQ-30K [35] and 30k randomly sampled images from MSCOCO [38] validation set for visual aesthetic quality, and Gen Eval [25] and DPG-Bench [28] for prompt alignment, respectively. |
| Dataset Splits | Yes | We train BIFROST-1 on Image Net [16] for the experiments in Sec. 5.1 and Sec. 5.2... For Image Net, following previous works [36, 56], we evaluate our model on Frรฉchet Inception Distance (FID) [27], s FID [45], and Inception Score (IS) [57]. For open prompt evaluation, we follow previous works [10, 51, 63, 59] and report FID scores on MJHQ-30K [35] and 30k randomly sampled images from MSCOCO [38] validation set for visual aesthetic quality, and Gen Eval [25] and DPG-Bench [28] for prompt alignment, respectively. |
| Hardware Specification | Yes | All experiments on Image Net (Sec. 5.1 and Sec. 5.2) are trained on a single GH200 GPU, and the So TA comparison experiments in Sec. 5.3 are trained on 16 GB200 GPUs. |
| Software Dependencies | No | The paper mentions software like PyTorch, Hugging Face Transformers, and Hugging Face Diffusers, but does not provide specific version numbers for these dependencies. |
| Experiment Setup | Yes | Specifically, we train a Control Net consisting of 4 MM-Di T [19] (Double Stream) blocks and 1 Single Di T [52] (Single Stream) block (i.e., num_double_layers=4, num_single_layers=1) with a total batch size of 48. All other training hyperparameters, including learning rate, Adam optimizer, and weight decay, are kept identical to the original codebase without any tuning. During latent Control Net inference, we also retain all default hyperparameters unchanged (e.g., num_inference_steps=28, controlnet_conditioning_scale=0.7, guidance_scale=3.5). For the MLLM visual prediction branch, we use the mean squared error (MSE) loss for image patch embedding prediction. For the latent Control Net, we use the original flow-matching loss used in FLUX Control Net. Specifically, the latent Control Net and BIFROST-1 MLLM in Sec. 5.1 are trained for 2 epochs and 16 epochs respectively, and the latent Control Net in Sec. 5.2 is trained for only 1 epoch ( 25M training steps). |