reproducibilityindex.ai

Zero-Shot Text-to-Image Generation

Authors: Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, Ilya Sutskever

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate our model zero-shot by comparing it to three prior approaches: Attn GAN (Xu et al., 2018), DMGAN (Zhu et al., 2019), and DF-GAN (Tao et al., 2020), the last of which reports the best Inception Score (Salimans et al., 2016) and Fréchet Inception Distance (Heusel et al., 2017) on MS-COCO.
Researcher Affiliation	Industry	1Open AI, San Francisco, California, United States.
Pseudocode	No	No pseudocode or algorithm blocks were found in the paper.
Open Source Code	Yes	1https://github.com/openai/DALL-E
Open Datasets	Yes	By comparison, text-to-image generation has typically been evaluated on relatively small datasets such as MS-COCO and CUB-200 (Welinder et al., 2010). Our preliminary experiments for models up to 1.2 billion parameters were carried out on Conceptual Captions, a dataset of 3.3 million text-image pairs that was developed as an extension to MS-COCO (Lin et al., 2014). To scale up to 12-billion parameters, we created a dataset of a similar scale to JFT-300M (Sun et al., 2017) by collecting 250 million text-images pairs from the internet. This dataset does not include MS-COCO, but does include Conceptual Captions and a ﬁltered subset of YFCC100M (Thomee et al., 2016).
Dataset Splits	Yes	We reserved about 606,000 images for validation, and found no signs of overﬁtting at convergence.
Hardware Specification	Yes	On NVIDIA V100 GPUs, this exponent range is specified by ﬁve bits. Our 12-billion parameter model consumes about 24 GB of memory when stored in 16-bit precision, which exceeds the memory of a 16 GB NVIDIA V100 GPU.
Software Dependencies	No	The paper mentions general algorithms and methods like Adam and BPE-encoding, but does not specify software packages with version numbers (e.g., Python, PyTorch, TensorFlow, or specific libraries with their versions).
Experiment Setup	Yes	Appendix A.2 gives a complete description of the hyperparameters, but we found the following to be especially important for stable training: Specific annealing schedules for the relaxation temperature and step size. ... We also found that increasing the KL weight to β = 6.6 promotes better codebook usage and ultimately leads to a smaller reconstruction error at the end of training. We normalize the cross-entropy losses for the text and image tokens by the total number of each kind in a batch of data. Since we are primarily interested in image modeling, we multiply the cross-entropy loss for the text by 1/8 and the cross-entropy loss for the image by 7/8.