Generating Images with Multimodal Language Models
Authors: Jing Yu Koh, Daniel Fried, Russ R. Salakhutdinov
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experimental results demonstrate that GILL is more effective than Stable Diffusion at processing longer-form text, including dialogue and discourse. We show on dialogue-conditioned image generation that GILL can outperform non-LLM based generation models, and benefit from multimodal context: generating images that match text better than the backbone generation models that we distill from. |
| Researcher Affiliation | Academia | Jing Yu Koh Carnegie Mellon University jingyuk@cs.cmu.edu Daniel Fried Carnegie Mellon University dfried@cs.cmu.edu Ruslan Salakhutdinov Carnegie Mellon University rsalakhu@cs.cmu.edu |
| Pseudocode | No | The paper provides architectural diagrams and mathematical formulations for its methods but does not include any pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code and pretrained models are publicly released at https://github.com/kohjingyu/gill. |
| Open Datasets | Yes | We train on Conceptual Captions (CC3M) [52], which consists of 3.3M image-text pairs. |
| Dataset Splits | Yes | We extract the most confident set of these annotations (retaining roughly 900 examples with an inter-annotator agreement of at least 4/5), and split them into a 67% train (600) and 33% test (300) split. FID on the CC3M validation set image captioning on the MS-COCO (2017) validation set |
| Hardware Specification | Yes | We train with a batch size of 200 for 20K iterations, which takes 2 days on 2 A6000 GPUs. |
| Software Dependencies | No | The paper mentions specific models like OPT-6.7B and Stable Diffusion v1.5, and uses bfloat16 precision and Adam optimizer, but does not specify software versions for libraries like TensorFlow or PyTorch. |
| Experiment Setup | Yes | We train with a batch size of 200 for 20K iterations, which takes 2 days on 2 A6000 GPUs. We use bfloat16 precision [1], and optimize using Adam [30] (β1 = 0.9, β2 = 0.95) with a learning rate of 0.001. We use k = 4 visual tokens, and r = 8 learnt [IMG] tokens. We set the GILLMapper query embedding dimension m = 512. For retrieval, we use an embedding dimension p = 256. |