Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

Authors: Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L. Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, Jonathan Ho, David J. Fleet, Mohammad Norouzi

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Imagen achieves a new state-of-the-art FID score of 7.27 on the COCO dataset, without ever training on COCO, and human raters find Imagen samples to be on par with the COCO data itself in image-text alignment. To assess text-to-image models in greater depth, we introduce Draw Bench, a comprehensive and challenging benchmark for text-to-image models. With Draw Bench, we compare Imagen with recent methods including VQ-GAN+CLIP, Latent Diffusion Models, GLIDE and DALL-E 2, and find that human raters prefer Imagen over other models in side-by-side comparisons, both in terms of sample quality and image-text alignment.
Researcher Affiliation Industry Google Research, Brain Team Toronto, Ontario, Canada
Pseudocode Yes See Appendix Fig. A.32 for reference pseudocode.
Open Source Code No These considerations inform our decision to not to release code or a public demo.
Open Datasets Yes We train on a combination of internal datasets, with 460M image-text pairs, and the publicly available LAION-400M dataset [64], with 400M image-text pairs.
Dataset Splits Yes The COCO [38] validation set is the standard benchmark for evaluating text-to-image models for both the supervised [85, 22] and the zero-shot setting [55, 43]. Consistent with previous works, we report zero-shot FID-30K, for which 30K prompts are drawn randomly from the validation set, and the model samples generated on these prompts are compared with reference images from the full validation set.
Hardware Specification Yes We use 256 TPU-v4 chips for our base 64 64 model, and 128 TPU-v4 chips for both super-resolution models.
Software Dependencies No The paper mentions using 'Adafactor' and 'Adam' as optimizers and 'JAX' as the machine learning framework, but does not provide specific version numbers for these or other software libraries and dependencies.
Experiment Setup Yes Unless specified, we train a 2B parameter model for the 64 64 text-to-image synthesis, and 600M and 400M parameter models for 64 64 256 256 and 256 256 1024 1024 for superresolution respectively. We use a batch size of 2048 and 2.5M training steps for all models. We use Adafactor for our base 64 64 model, because initial comparisons with Adam suggested similar performance with much smaller memory footprint for Adafactor. For superresolution models, we use Adam as we found Adafactor to hurt model quality in our initial ablations. For classifier-free guidance, we joint-train unconditionally via zeroing out the text embeddings with 10% probability for all three models.