reproducibilityindex.ai

Generating Images with Multimodal Language Models

Authors: Jing Yu Koh, Daniel Fried, Russ R. Salakhutdinov

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experimental results demonstrate that GILL is more effective than Stable Diffusion at processing longer-form text, including dialogue and discourse. We show on dialogue-conditioned image generation that GILL can outperform non-LLM based generation models, and benefit from multimodal context: generating images that match text better than the backbone generation models that we distill from.
Researcher Affiliation	Academia	Jing Yu Koh Carnegie Mellon University jingyuk@cs.cmu.edu Daniel Fried Carnegie Mellon University dfried@cs.cmu.edu Ruslan Salakhutdinov Carnegie Mellon University rsalakhu@cs.cmu.edu
Pseudocode	No	The paper provides architectural diagrams and mathematical formulations for its methods but does not include any pseudocode or algorithm blocks.
Open Source Code	Yes	Our code and pretrained models are publicly released at https://github.com/kohjingyu/gill.
Open Datasets	Yes	We train on Conceptual Captions (CC3M) [52], which consists of 3.3M image-text pairs.
Dataset Splits	Yes	We extract the most confident set of these annotations (retaining roughly 900 examples with an inter-annotator agreement of at least 4/5), and split them into a 67% train (600) and 33% test (300) split. FID on the CC3M validation set image captioning on the MS-COCO (2017) validation set
Hardware Specification	Yes	We train with a batch size of 200 for 20K iterations, which takes 2 days on 2 A6000 GPUs.
Software Dependencies	No	The paper mentions specific models like OPT-6.7B and Stable Diffusion v1.5, and uses bfloat16 precision and Adam optimizer, but does not specify software versions for libraries like TensorFlow or PyTorch.
Experiment Setup	Yes	We train with a batch size of 200 for 20K iterations, which takes 2 days on 2 A6000 GPUs. We use bfloat16 precision [1], and optimize using Adam [30] (β1 = 0.9, β2 = 0.95) with a learning rate of 0.001. We use k = 4 visual tokens, and r = 8 learnt [IMG] tokens. We set the GILLMapper query embedding dimension m = 512. For retrieval, we use an embedding dimension p = 256.