CogView2: Faster and Better Text-to-Image Generation via Hierarchical Transformers

Authors: Ming Ding, Wendi Zheng, Wenyi Hong, Jie Tang

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 5 ExperimentsThe results of machine evaluation are demonstrated in Table 1.
Researcher Affiliation Academia Ming Ding Wendi Zheng Wenyi Hong Jie Tang Tsinghua University BAAI {dm18@mails, jietang@mail}.tsinghua.edu.cn
Pseudocode No No pseudocode or algorithm blocks were found in the paper.
Open Source Code No Codes and a demo website will be updated at https://github.com/THUDM/Cog View2.
Open Datasets Yes To compare with previous and concurrent works, we follow the most popular benchmark originated from DALL-E [26], Fréchet Inception Distances and Inception Scores evaluated on MS-COCO [17].
Dataset Splits Yes 30,000 captions from the validation set are sampled to evaluate the FID.
Hardware Specification Yes The wall-clock time and FLOPs for a 4,096 sequence on an A100-40GB GPU with different AR-related methods.
Software Dependencies No The paper mentions 'Pytorch' but does not specify a version number for it or any other key software dependencies.
Experiment Setup Yes The model has 6 billion parameters (48 layers, hidden size 3072, 48 attention heads), trained for 300,000 iterations in FP16 with batch size 4,096. The sequence length is 512, consisting of 400 image tokens, 1 separator and up to 111 text tokens.