CV-VAE: A Compatible Video VAE for Latent Generative Video Models

Authors: Sijie Zhao, Yong Zhang, Xiaodong Cun, Shaoshu Yang, Muyao Niu, Xiaoyu Li, Wenbo HU, Ying Shan

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments are conducted to demonstrate the effectiveness of the proposed video VAE. 4 Experiments
Researcher Affiliation Industry Tencent AI Lab
Pseudocode No Figure 7 shows the "Architecture of CV-VAE" which is a diagram, not pseudocode or an algorithm block.
Open Source Code No https://github.com/AILab-CVC/CV-VAE. Code and checkpoints will be released upon the acceptance of this paper.
Open Datasets Yes We train our CV-VAE model using image datasets including LAION-COCO [9] and Unsplash [23], as well as the video dataset Webvid-10M [3].
Dataset Splits Yes We evaluate our CV-VAE on the COCO2017 [21] validation dataset and the Webvid [3] validation dataset which includes 1024 videos.
Hardware Specification Yes To avoid numerical overflow, we trained CV-VAE using float32 precision, and the training was carried out on 16 A100 GPUs for 200K steps. ... The training was carried out on 16 A100 GPUs for 5K steps.
Software Dependencies No The paper mentions employing the Adam W optimizer, deepspeed stage 2, gradient checkpointing techniques, and training with float32/bfloat16 precision, but does not provide specific version numbers for software libraries like PyTorch or TensorFlow.
Experiment Setup Yes For image datasets, we employ two resolutions, i.e., 256 256 and 512 512. In the case of video datasets, we use two settings of frames and resolutions: 9 256 256 and 17 192 192. The batch sizes for these four settings are 8, 2, 1, and 1, with sampling ratios of 40%, 10%, 25%, and 25%, respectively. We employed the Adam W optimizer [22] with a learning rate of 1e-4 and cosine learning rate decay.