Zero-Shot Text-to-Image Generation
Authors: Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, Ilya Sutskever
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our model zero-shot by comparing it to three prior approaches: Attn GAN (Xu et al., 2018), DMGAN (Zhu et al., 2019), and DF-GAN (Tao et al., 2020), the last of which reports the best Inception Score (Salimans et al., 2016) and Fréchet Inception Distance (Heusel et al., 2017) on MS-COCO. |
| Researcher Affiliation | Industry | 1Open AI, San Francisco, California, United States. |
| Pseudocode | No | No pseudocode or algorithm blocks were found in the paper. |
| Open Source Code | Yes | 1https://github.com/openai/DALL-E |
| Open Datasets | Yes | By comparison, text-to-image generation has typically been evaluated on relatively small datasets such as MS-COCO and CUB-200 (Welinder et al., 2010). Our preliminary experiments for models up to 1.2 billion parameters were carried out on Conceptual Captions, a dataset of 3.3 million text-image pairs that was developed as an extension to MS-COCO (Lin et al., 2014). To scale up to 12-billion parameters, we created a dataset of a similar scale to JFT-300M (Sun et al., 2017) by collecting 250 million text-images pairs from the internet. This dataset does not include MS-COCO, but does include Conceptual Captions and a filtered subset of YFCC100M (Thomee et al., 2016). |
| Dataset Splits | Yes | We reserved about 606,000 images for validation, and found no signs of overfitting at convergence. |
| Hardware Specification | Yes | On NVIDIA V100 GPUs, this exponent range is specified by five bits. Our 12-billion parameter model consumes about 24 GB of memory when stored in 16-bit precision, which exceeds the memory of a 16 GB NVIDIA V100 GPU. |
| Software Dependencies | No | The paper mentions general algorithms and methods like Adam and BPE-encoding, but does not specify software packages with version numbers (e.g., Python, PyTorch, TensorFlow, or specific libraries with their versions). |
| Experiment Setup | Yes | Appendix A.2 gives a complete description of the hyperparameters, but we found the following to be especially important for stable training: Specific annealing schedules for the relaxation temperature and step size. ... We also found that increasing the KL weight to β = 6.6 promotes better codebook usage and ultimately leads to a smaller reconstruction error at the end of training. We normalize the cross-entropy losses for the text and image tokens by the total number of each kind in a batch of data. Since we are primarily interested in image modeling, we multiply the cross-entropy loss for the text by 1/8 and the cross-entropy loss for the image by 7/8. |