Generating Images with Multimodal Language Models

Authors: Jing Yu Koh, Daniel Fried, Russ R. Salakhutdinov

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experimental results demonstrate that GILL is more effective than Stable Diffusion at processing longer-form text, including dialogue and discourse. We show on dialogue-conditioned image generation that GILL can outperform non-LLM based generation models, and benefit from multimodal context: generating images that match text better than the backbone generation models that we distill from.
Researcher Affiliation Academia Jing Yu Koh Carnegie Mellon University jingyuk@cs.cmu.edu Daniel Fried Carnegie Mellon University dfried@cs.cmu.edu Ruslan Salakhutdinov Carnegie Mellon University rsalakhu@cs.cmu.edu
Pseudocode No The paper provides architectural diagrams and mathematical formulations for its methods but does not include any pseudocode or algorithm blocks.
Open Source Code Yes Our code and pretrained models are publicly released at https://github.com/kohjingyu/gill.
Open Datasets Yes We train on Conceptual Captions (CC3M) [52], which consists of 3.3M image-text pairs.
Dataset Splits Yes We extract the most confident set of these annotations (retaining roughly 900 examples with an inter-annotator agreement of at least 4/5), and split them into a 67% train (600) and 33% test (300) split. FID on the CC3M validation set image captioning on the MS-COCO (2017) validation set
Hardware Specification Yes We train with a batch size of 200 for 20K iterations, which takes 2 days on 2 A6000 GPUs.
Software Dependencies No The paper mentions specific models like OPT-6.7B and Stable Diffusion v1.5, and uses bfloat16 precision and Adam optimizer, but does not specify software versions for libraries like TensorFlow or PyTorch.
Experiment Setup Yes We train with a batch size of 200 for 20K iterations, which takes 2 days on 2 A6000 GPUs. We use bfloat16 precision [1], and optimize using Adam [30] (β1 = 0.9, β2 = 0.95) with a learning rate of 0.001. We use k = 4 visual tokens, and r = 8 learnt [IMG] tokens. We set the GILLMapper query embedding dimension m = 512. For retrieval, we use an embedding dimension p = 256.