CogView2: Faster and Better Text-to-Image Generation via Hierarchical Transformers
Authors: Ming Ding, Wendi Zheng, Wenyi Hong, Jie Tang
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 5 ExperimentsThe results of machine evaluation are demonstrated in Table 1. |
| Researcher Affiliation | Academia | Ming Ding Wendi Zheng Wenyi Hong Jie Tang Tsinghua University BAAI {dm18@mails, jietang@mail}.tsinghua.edu.cn |
| Pseudocode | No | No pseudocode or algorithm blocks were found in the paper. |
| Open Source Code | No | Codes and a demo website will be updated at https://github.com/THUDM/Cog View2. |
| Open Datasets | Yes | To compare with previous and concurrent works, we follow the most popular benchmark originated from DALL-E [26], Fréchet Inception Distances and Inception Scores evaluated on MS-COCO [17]. |
| Dataset Splits | Yes | 30,000 captions from the validation set are sampled to evaluate the FID. |
| Hardware Specification | Yes | The wall-clock time and FLOPs for a 4,096 sequence on an A100-40GB GPU with different AR-related methods. |
| Software Dependencies | No | The paper mentions 'Pytorch' but does not specify a version number for it or any other key software dependencies. |
| Experiment Setup | Yes | The model has 6 billion parameters (48 layers, hidden size 3072, 48 attention heads), trained for 300,000 iterations in FP16 with batch size 4,096. The sequence length is 512, consisting of 400 image tokens, 1 separator and up to 111 text tokens. |