CV-VAE: A Compatible Video VAE for Latent Generative Video Models
Authors: Sijie Zhao, Yong Zhang, Xiaodong Cun, Shaoshu Yang, Muyao Niu, Xiaoyu Li, Wenbo HU, Ying Shan
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments are conducted to demonstrate the effectiveness of the proposed video VAE. 4 Experiments |
| Researcher Affiliation | Industry | Tencent AI Lab |
| Pseudocode | No | Figure 7 shows the "Architecture of CV-VAE" which is a diagram, not pseudocode or an algorithm block. |
| Open Source Code | No | https://github.com/AILab-CVC/CV-VAE. Code and checkpoints will be released upon the acceptance of this paper. |
| Open Datasets | Yes | We train our CV-VAE model using image datasets including LAION-COCO [9] and Unsplash [23], as well as the video dataset Webvid-10M [3]. |
| Dataset Splits | Yes | We evaluate our CV-VAE on the COCO2017 [21] validation dataset and the Webvid [3] validation dataset which includes 1024 videos. |
| Hardware Specification | Yes | To avoid numerical overflow, we trained CV-VAE using float32 precision, and the training was carried out on 16 A100 GPUs for 200K steps. ... The training was carried out on 16 A100 GPUs for 5K steps. |
| Software Dependencies | No | The paper mentions employing the Adam W optimizer, deepspeed stage 2, gradient checkpointing techniques, and training with float32/bfloat16 precision, but does not provide specific version numbers for software libraries like PyTorch or TensorFlow. |
| Experiment Setup | Yes | For image datasets, we employ two resolutions, i.e., 256 256 and 512 512. In the case of video datasets, we use two settings of frames and resolutions: 9 256 256 and 17 192 192. The batch sizes for these four settings are 8, 2, 1, and 1, with sampling ratios of 40%, 10%, 25%, and 25%, respectively. We employed the Adam W optimizer [22] with a learning rate of 1e-4 and cosine learning rate decay. |