CogView: Mastering Text-to-Image Generation via Transformers
Authors: Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, Hongxia Yang, Jie Tang
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Cog View achieves the state-of-the-art FID on the blurred MS COCO dataset, outperforming previous GAN-based models and a recent similar work DALL-E. |
| Researcher Affiliation | Collaboration | Tsinghua University DAMO Academy, Alibaba Group BAAI {dm18@mails, jietang@mail}.tsinghua.edu.cn |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | 1Codes and models are at https://github.com/THUDM/Cog View. |
| Open Datasets | Yes | At present, the most authoritative machine evaluation metrics for general-domain text-to-image generation is the FID on MS COCO, which is not included in our training set. [31] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740 755. Springer, 2014. |
| Dataset Splits | No | The paper does not explicitly provide training/validation/test dataset splits with specific percentages or sample counts. While it mentions evaluating on a 'subset' for testing, it doesn't detail how the overall dataset was partitioned for training and validation phases. |
| Hardware Specification | Yes | We train the model with batch size of 6,144 sequences (6.7 million tokens per batch) for 144,000 steps on 512 V100 GPUs (32GB). |
| Software Dependencies | No | The paper mentions software components like "Adam" (optimizer) and "Sentence Piece" (tokenizer) but does not provide specific version numbers for these or any other software dependencies. |
| Experiment Setup | Yes | We train the model with batch size of 6,144 sequences (6.7 million tokens per batch) for 144,000 steps on 512 V100 GPUs (32GB). The parameters are updated by Adam with max lr = 3 10 4, β1 = 0.9, β2 = 0.95, weight decay = 4 10 2. The learning rate warms up during the first 2% steps and decays with cosine annealing [34]. |