MultiFusion: Fusing Pre-Trained Models for Multi-Lingual, Multi-Modal Image Generation

Authors: Marco Bellagente, Manuel Brack, Hannah Teufel, Felix Friedrich, Björn Deiseroth, Constantin Eichenberg, Andrew M. Dai, Robert Baldock, Souradeep Nanda, Koen Oostermeijer, Andres Felipe Cruz-Salinas, Patrick Schramowski, Kristian Kersting, Samuel Weinbach

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experimental results demonstrate the efficient transfer of capabilities from individual modules to the downstream model. Specifically, the fusion of all independent components allows the image generation module to utilize multilingual, interleaved multimodal inputs despite being trained solely on monomodal data in a single language.
Researcher Affiliation Collaboration 1Aleph Alpha, 2German Research Center for Artificial Intelligence (DFKI), 3Computer Science Department, TU Darmstadt, 4Stability AI, 5University of Texas, 6Hessian.AI, 7Centre for Cognitive Science, TU Darmstadt, 8LAION
Pseudocode No The paper describes the architecture and processes but does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code No The paper states: 'Therefore, we will not make the model weights publicly available in their current form.' While it makes the MCC-250 benchmark (a dataset) available, it does not provide the source code for the described methodology/model.
Open Datasets Yes As training data, we use LAION aesthetics V.2 5+8, i.e., the subset of LAION 5B with English captions filtered by a predicted aesthetic score > 5 [42]. ... We used a custom version of SNLI [7] and MNLI [47] that extends the original English texts by their machinetranslated German versions, which were generated using the Deep L API.
Dataset Splits Yes We investigate image fidelity using FID-30k scores on the MS COCO validation set [25]. We report the results using textual, multimodal, and image prompts and a comparison against SD v1.5 in Tab. 1.
Hardware Specification No The paper mentions 'GPU hours' in Table 3 but does not specify any particular GPU models (e.g., NVIDIA A100, RTX 3090) or CPU specifications used for experiments. 'GPU hours' is too generic.
Software Dependencies No The paper mentions various models and frameworks like Stable Diffusion, CLIP, GPT-3, and MAGMA but does not provide specific version numbers for any software dependencies (e.g., Python, PyTorch, TensorFlow, CUDA, specific libraries).
Experiment Setup Yes The final model is finetuned for 60k steps with the probability of using an image instead of a caption being 0.2. ... We optimize the bias weights of the LM using a contrastive learning objective for 13k steps... we utilize attention manipulation [14] in every attention layer of the transformer encoder. ... We relied on Amazon Mechanical Turk, where we set the following qualification requirements for our users: HIT Approval Rate over 95% and at least 1000 HITs approved. Annotators were fairly compensated according to Amazon MTurk guidelines. Users were paid $0.60 for a batch of 28 images.