CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers
Authors: Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, Jie Tang
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present Cog Video, which is the largest and open-source pretrained transformer for general text-to-video generation. Cog Video demonstrates state-of-the-art FVD on the UCF101 benchmark. We propose the multi-frame-rate training to better align text-clip pairs, which significantly improves the generation accuracy, in particular for movements of complex semantics. This training strategy offers Cog Video the capacity of controlling the intensity of changes during the generation. We design dual-channel attention to elegantly and efficiently finetune a pretrained text-to-image generative model for text-to-video generation, avoiding the expensive full parameter pretraining from scratch. EXPERIMENTS Machine evaluation was conducted on two popular benchmarks for video generation, UCF101 (Soomro et al., 2012) and Kinetics-600 (Carreira et al., 2018). |
| Researcher Affiliation | Academia | Wenyi Hong Ming Ding Wendi Zheng Xinghan Liu Jie Tang Tsinghua University BAAI {hwy22@mails, dm18@mails, jietang@}.tsinghua.edu.cn |
| Pseudocode | No | No pseudocode or algorithm blocks were found in the paper. The methodology is described through text and diagrams. |
| Open Source Code | Yes | Its codes and model are also publicly available at https://github.com/THUDM/Cog Video. We create an anonymous repository https://anonymous.4open. science/r/Cog Video-anonymous-4148, containing codes for pretraining and inference. As pretraining requires huge computational costs (pretraining Cog Video takes 20 days on 104 A100 GPUs), we also release checkpoints to the public to further ensure reproducibility. |
| Open Datasets | Yes | We pretrain our model on a dataset of 5.4 million text-video pairs with a resolution of 160 × 160 (can be upsampled to 480 × 480 further). The data is mainly crawled from the Internet, where each video has its matching caption. About 30% of the captions are in English, which have been translated into Chinese by machine translation. About 50% of the captions are sentences, while the others are made of phrases. Machine evaluation was conducted on two popular benchmarks for video generation, UCF101 (Soomro et al., 2012) and Kinetics-600 (Carreira et al., 2018). |
| Dataset Splits | No | The paper does not explicitly state the training/validation/test splits with percentages or counts for the pretraining dataset. For UCF101, it mentions "We use class labels as input text and generate samples according to the class distribution during inference. For a fair comparison with previous works, we follow Ge et al. (2022) to resize the original 160 × 160 Cog Video generation to 128 × 128, and evaluate FVD and IS over 2,048 and 10,000 samples respectively." For Kinetics-600, it says "Kinetics-600 contains 600 classes of human action videos, with roughly 350,000 train and 50,000 test videos in total. We use the action category as input text, and finetune Cog Video on the training set for 12,000 iterations with a batch size of 640." |
| Hardware Specification | Yes | As pretraining requires huge computational costs (pretraining Cog Video takes 20 days on 104 A100 GPUs). |
| Software Dependencies | No | The paper mentions tokenization using icetk (with a link to its GitHub) and optimization by Adam, but does not provide specific version numbers for these or other software libraries like PyTorch or TensorFlow. |
| Experiment Setup | Yes | The model in stage 1 is first pretrained for 76,000 iterations on video clips with a minimum frame rate of 0.25 fps, then trained for 15,000 iterations with a minimum frame rate of 1 fps. The model in stage 2 is pretrained for 78,500 iterations with frame rates of 2, 4, and 8 fps. Both models are trained in FP16 with batch size = 416, and optimized by Adam with max learning rate = 2 × 10−4, β1 = 0.9, β2 = 0.95, weight decay = 1 × 10−2. |