Zero-Shot Text-to-Image Generation

Authors: Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, Ilya Sutskever

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our model zero-shot by comparing it to three prior approaches: Attn GAN (Xu et al., 2018), DMGAN (Zhu et al., 2019), and DF-GAN (Tao et al., 2020), the last of which reports the best Inception Score (Salimans et al., 2016) and Fréchet Inception Distance (Heusel et al., 2017) on MS-COCO.
Researcher Affiliation Industry 1Open AI, San Francisco, California, United States.
Pseudocode No No pseudocode or algorithm blocks were found in the paper.
Open Source Code Yes 1https://github.com/openai/DALL-E
Open Datasets Yes By comparison, text-to-image generation has typically been evaluated on relatively small datasets such as MS-COCO and CUB-200 (Welinder et al., 2010). Our preliminary experiments for models up to 1.2 billion parameters were carried out on Conceptual Captions, a dataset of 3.3 million text-image pairs that was developed as an extension to MS-COCO (Lin et al., 2014). To scale up to 12-billion parameters, we created a dataset of a similar scale to JFT-300M (Sun et al., 2017) by collecting 250 million text-images pairs from the internet. This dataset does not include MS-COCO, but does include Conceptual Captions and a filtered subset of YFCC100M (Thomee et al., 2016).
Dataset Splits Yes We reserved about 606,000 images for validation, and found no signs of overfitting at convergence.
Hardware Specification Yes On NVIDIA V100 GPUs, this exponent range is specified by five bits. Our 12-billion parameter model consumes about 24 GB of memory when stored in 16-bit precision, which exceeds the memory of a 16 GB NVIDIA V100 GPU.
Software Dependencies No The paper mentions general algorithms and methods like Adam and BPE-encoding, but does not specify software packages with version numbers (e.g., Python, PyTorch, TensorFlow, or specific libraries with their versions).
Experiment Setup Yes Appendix A.2 gives a complete description of the hyperparameters, but we found the following to be especially important for stable training: Specific annealing schedules for the relaxation temperature and step size. ... We also found that increasing the KL weight to β = 6.6 promotes better codebook usage and ultimately leads to a smaller reconstruction error at the end of training. We normalize the cross-entropy losses for the text and image tokens by the total number of each kind in a batch of data. Since we are primarily interested in image modeling, we multiply the cross-entropy loss for the text by 1/8 and the cross-entropy loss for the image by 7/8.