Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Generating Images with Multimodal Language Models
Authors: Jing Yu Koh, Daniel Fried, Russ R. Salakhutdinov
NeurIPS 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experimental results demonstrate that GILL is more effective than Stable Diffusion at processing longer-form text, including dialogue and discourse. We show on dialogue-conditioned image generation that GILL can outperform non-LLM based generation models, and benefit from multimodal context: generating images that match text better than the backbone generation models that we distill from. |
| Researcher Affiliation | Academia | Jing Yu Koh Carnegie Mellon University EMAIL Daniel Fried Carnegie Mellon University EMAIL Ruslan Salakhutdinov Carnegie Mellon University EMAIL |
| Pseudocode | No | The paper provides architectural diagrams and mathematical formulations for its methods but does not include any pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code and pretrained models are publicly released at https://github.com/kohjingyu/gill. |
| Open Datasets | Yes | We train on Conceptual Captions (CC3M) [52], which consists of 3.3M image-text pairs. |
| Dataset Splits | Yes | We extract the most confident set of these annotations (retaining roughly 900 examples with an inter-annotator agreement of at least 4/5), and split them into a 67% train (600) and 33% test (300) split. FID on the CC3M validation set image captioning on the MS-COCO (2017) validation set |
| Hardware Specification | Yes | We train with a batch size of 200 for 20K iterations, which takes 2 days on 2 A6000 GPUs. |
| Software Dependencies | No | The paper mentions specific models like OPT-6.7B and Stable Diffusion v1.5, and uses bfloat16 precision and Adam optimizer, but does not specify software versions for libraries like TensorFlow or PyTorch. |
| Experiment Setup | Yes | We train with a batch size of 200 for 20K iterations, which takes 2 days on 2 A6000 GPUs. We use bfloat16 precision [1], and optimize using Adam [30] (β1 = 0.9, β2 = 0.95) with a learning rate of 0.001. We use k = 4 visual tokens, and r = 8 learnt [IMG] tokens. We set the GILLMapper query embedding dimension m = 512. For retrieval, we use an embedding dimension p = 256. |