Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
CV-VAE: A Compatible Video VAE for Latent Generative Video Models
Authors: Sijie Zhao, Yong Zhang, Xiaodong Cun, Shaoshu Yang, Muyao Niu, Xiaoyu Li, Wenbo HU, Ying Shan
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments are conducted to demonstrate the effectiveness of the proposed video VAE. 4 Experiments |
| Researcher Affiliation | Industry | Tencent AI Lab |
| Pseudocode | No | Figure 7 shows the "Architecture of CV-VAE" which is a diagram, not pseudocode or an algorithm block. |
| Open Source Code | No | https://github.com/AILab-CVC/CV-VAE. Code and checkpoints will be released upon the acceptance of this paper. |
| Open Datasets | Yes | We train our CV-VAE model using image datasets including LAION-COCO [9] and Unsplash [23], as well as the video dataset Webvid-10M [3]. |
| Dataset Splits | Yes | We evaluate our CV-VAE on the COCO2017 [21] validation dataset and the Webvid [3] validation dataset which includes 1024 videos. |
| Hardware Specification | Yes | To avoid numerical overflow, we trained CV-VAE using float32 precision, and the training was carried out on 16 A100 GPUs for 200K steps. ... The training was carried out on 16 A100 GPUs for 5K steps. |
| Software Dependencies | No | The paper mentions employing the Adam W optimizer, deepspeed stage 2, gradient checkpointing techniques, and training with float32/bfloat16 precision, but does not provide specific version numbers for software libraries like PyTorch or TensorFlow. |
| Experiment Setup | Yes | For image datasets, we employ two resolutions, i.e., 256 256 and 512 512. In the case of video datasets, we use two settings of frames and resolutions: 9 256 256 and 17 192 192. The batch sizes for these four settings are 8, 2, 1, and 1, with sampling ratios of 40%, 10%, 25%, and 25%, respectively. We employed the Adam W optimizer [22] with a learning rate of 1e-4 and cosine learning rate decay. |