MultiFusion: Fusing Pre-Trained Models for Multi-Lingual, Multi-Modal Image Generation
Authors: Marco Bellagente, Manuel Brack, Hannah Teufel, Felix Friedrich, Björn Deiseroth, Constantin Eichenberg, Andrew M. Dai, Robert Baldock, Souradeep Nanda, Koen Oostermeijer, Andres Felipe Cruz-Salinas, Patrick Schramowski, Kristian Kersting, Samuel Weinbach
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experimental results demonstrate the efficient transfer of capabilities from individual modules to the downstream model. Specifically, the fusion of all independent components allows the image generation module to utilize multilingual, interleaved multimodal inputs despite being trained solely on monomodal data in a single language. |
| Researcher Affiliation | Collaboration | 1Aleph Alpha, 2German Research Center for Artificial Intelligence (DFKI), 3Computer Science Department, TU Darmstadt, 4Stability AI, 5University of Texas, 6Hessian.AI, 7Centre for Cognitive Science, TU Darmstadt, 8LAION |
| Pseudocode | No | The paper describes the architecture and processes but does not contain any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | No | The paper states: 'Therefore, we will not make the model weights publicly available in their current form.' While it makes the MCC-250 benchmark (a dataset) available, it does not provide the source code for the described methodology/model. |
| Open Datasets | Yes | As training data, we use LAION aesthetics V.2 5+8, i.e., the subset of LAION 5B with English captions filtered by a predicted aesthetic score > 5 [42]. ... We used a custom version of SNLI [7] and MNLI [47] that extends the original English texts by their machinetranslated German versions, which were generated using the Deep L API. |
| Dataset Splits | Yes | We investigate image fidelity using FID-30k scores on the MS COCO validation set [25]. We report the results using textual, multimodal, and image prompts and a comparison against SD v1.5 in Tab. 1. |
| Hardware Specification | No | The paper mentions 'GPU hours' in Table 3 but does not specify any particular GPU models (e.g., NVIDIA A100, RTX 3090) or CPU specifications used for experiments. 'GPU hours' is too generic. |
| Software Dependencies | No | The paper mentions various models and frameworks like Stable Diffusion, CLIP, GPT-3, and MAGMA but does not provide specific version numbers for any software dependencies (e.g., Python, PyTorch, TensorFlow, CUDA, specific libraries). |
| Experiment Setup | Yes | The final model is finetuned for 60k steps with the probability of using an image instead of a caption being 0.2. ... We optimize the bias weights of the LM using a contrastive learning objective for 13k steps... we utilize attention manipulation [14] in every attention layer of the transformer encoder. ... We relied on Amazon Mechanical Turk, where we set the following qualification requirements for our users: HIT Approval Rate over 95% and at least 1000 HITs approved. Annotators were fairly compensated according to Amazon MTurk guidelines. Users were paid $0.60 for a batch of 28 images. |