RAPHAEL: Text-to-Image Generation via Large Mixture of Diffusion Paths

Authors: Zeyue Xue, Guanglu Song, Qiushan Guo, Boxiao Liu, Zhuofan Zong, Yu Liu, Ping Luo

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Comprehensive experiments reveal that RAPHAEL outperforms recent cutting-edge models, such as Stable Diffusion, ERNIE-Vi LG 2.0, Deep Floyd, and DALL-E 2, in terms of both image quality and aesthetic appeal.
Researcher Affiliation Collaboration Zeyue Xue The University of Hong Kong xuezeyue@connect.hku.hk Guanglu Song Sense Time Research songguanglu@sensetime.com Qiushan Guo The University of Hong Kong qsguo@cs.hku.hk Boxiao Liu Sense Time Research liuboxiao@sensetime.com Zhuofan Zong Sense Time Research zongzhuofan@gmail.com Yu Liu Sense Time Research liuyuisanai@gmail.com Ping Luo The University of Hong Kong Shanghai AI Laboratory pluo@cs.hku.hk
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes More details can be found on a webpage: https: //raphael-painter.github.io/
Open Datasets Yes The training dataset consists of a subset of LAION-5B [20] and some internal datasets, including 730M text-images pairs in total.
Dataset Splits No The paper mentions selecting 30,000 images from the validation set for evaluation, but does not provide specific details on how the training, validation, and test splits were created for the overall dataset or if standard splits were followed beyond the selection from COCO's validation set.
Hardware Specification Yes a single model with three billion parameters, trained on 1, 000 A100 GPUs for two months
Software Dependencies No The entire model is implemented in Py Torch [24], and is trained by Adam W [25] optimizer. While PyTorch is mentioned, a specific version number is not provided, making the software dependency description incomplete for reproducibility.
Experiment Setup Yes The entire model is implemented in Py Torch [24], and is trained by Adam W [25] optimizer with a learning rate of 1e 4, a weight decay of 0, a batch size of 2, 000... Our experiments reveal that a suitable choice for Tc is 500, ensuring the effective learning of texture information.