CogView: Mastering Text-to-Image Generation via Transformers

Authors: Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, Hongxia Yang, Jie Tang

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Cog View achieves the state-of-the-art FID on the blurred MS COCO dataset, outperforming previous GAN-based models and a recent similar work DALL-E.
Researcher Affiliation Collaboration Tsinghua University DAMO Academy, Alibaba Group BAAI {dm18@mails, jietang@mail}.tsinghua.edu.cn
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes 1Codes and models are at https://github.com/THUDM/Cog View.
Open Datasets Yes At present, the most authoritative machine evaluation metrics for general-domain text-to-image generation is the FID on MS COCO, which is not included in our training set. [31] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740 755. Springer, 2014.
Dataset Splits No The paper does not explicitly provide training/validation/test dataset splits with specific percentages or sample counts. While it mentions evaluating on a 'subset' for testing, it doesn't detail how the overall dataset was partitioned for training and validation phases.
Hardware Specification Yes We train the model with batch size of 6,144 sequences (6.7 million tokens per batch) for 144,000 steps on 512 V100 GPUs (32GB).
Software Dependencies No The paper mentions software components like "Adam" (optimizer) and "Sentence Piece" (tokenizer) but does not provide specific version numbers for these or any other software dependencies.
Experiment Setup Yes We train the model with batch size of 6,144 sequences (6.7 million tokens per batch) for 144,000 steps on 512 V100 GPUs (32GB). The parameters are updated by Adam with max lr = 3 10 4, β1 = 0.9, β2 = 0.95, weight decay = 4 10 2. The learning rate warms up during the first 2% steps and decays with cosine annealing [34].