Re-Imagen: Retrieval-Augmented Text-to-Image Generator

Authors: Wenhu Chen, Hexiang Hu, Chitwan Saharia, William W. Cohen

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We train Re Imagen on a constructed dataset containing (image, text, retrieval) triples to teach the model to ground on both text prompt and retrieval. Furthermore, we develop a new sampling strategy to interleave the classifier-free guidance for text and retrieval conditions to balance the text and retrieval alignment. Re-Imagen achieves significant gain on FID score over COCO and Wiki Image. To further evaluate the capabilities of the model, we introduce Entity Draw Bench, a new benchmark that evaluates image generation for diverse entities, from frequent to rare, across multiple object categories including dogs, foods, landmarks, birds, and characters. Human evaluation on Entity Draw Bench shows that Re-Imagen can significantly improve the fidelity of generated images, especially on less frequent entities.
Researcher Affiliation Industry Wenhu Chen, Hexiang Hu, Chitwan Saharia, William W. Cohen Google Research {wenhuchen,hexiang,sahariac,wcohen}@google.com
Pseudocode No The paper describes the model architecture and processes using natural language and diagrams, but it does not include any structured pseudocode or algorithm blocks.
Open Source Code No Considering such potential threats to the public, we will be cautious about code and API release. In future work, we will explore a framework for responsible use that balances the value of external auditing of research with the risks of unrestricted open access, allowing this work to be used in a safe and beneficial way.
Open Datasets Yes Re-Imagen achieves significant gain on FID score over COCO and Wiki Image. To further evaluate the capabilities of the model, we introduce Entity Draw Bench, a new benchmark that evaluates image generation for diverse entities, from frequent to rare, across multiple object categories including dogs, foods, landmarks, birds, and characters. Human evaluation on Entity Draw Bench shows that Re-Imagen can significantly improve the fidelity of generated images, especially on less frequent entities.
Dataset Splits Yes We randomly sample 30K prompts from the validation set as input to the model. The generated images are compared with the reference images from the full validation set (42K). We randomly sample 22K as our validation set to perform zero-shot evaluation, we further sample 20K prompts from the dataset as the input.
Hardware Specification Yes The fine-tuning was run for 200K steps on 64 TPU-v4 chips and completed within two days. The inference speed is 30-40 secs for 4 images on 4 TPU-v4 chips.
Software Dependencies No The paper mentions software components like T5 embedding, BM25, CLIP, ScaNN, Adafactor, and Adam, but it does not specify version numbers for these dependencies.
Experiment Setup Yes The guidance weight w for the 64 model is swept over [1.0, 1.25, 1.5, 1.75, 2.0], while the 256 256 superresolution models guidance weight w is swept over [1.0, 5.0, 8.0, 10.0]. We set the number of neighbors k=2 and set γ=BM25 during training. The fine-tuning was run for 200K steps... We use Adafactor for the 64 model and Adam for the 256 superresolution model with a learning rate of 1e-4. for the 64 diffusion model, which runs for 256 diffusion steps under a strong guidance weight of w=30 for both text and neighbor conditions. For the 256 and 1024 resolution models, we use a constant guidance weight of 5.0 and 3.0, respectively, with 128 and 32 diffusion steps.