VersVideo: Leveraging Enhanced Temporal Diffusion Models for Versatile Video Generation

Authors: Jinxi Xiang, Ricong Huang, Jun Zhang, Guanbin Li, Xiao Han, Yang Wei

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 3 EXPERIMENTS, Evaluation. We evaluated our model on the MSR-VTT (Xu et al., 2016) and UCF-101 datasets using the FVD (Unterthiner et al., 2018), CLIPSIM (Radford et al., 2021), and IS metrics. In the ablation study, we further employ frame consistency (FC) to assess video continuity by calculating the average CLIP similarity between two successive frames (Wang et al., 2023d; Esser et al., 2023).
Researcher Affiliation Collaboration 1Tencent AI Lab, Shenzhen, China 2School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code No The paper states: 'Examples of the videos generated by our model can be found at https://jinxixiang.github.io/versvideo/.' which is a project demo page, not a code repository. There is no explicit statement about releasing the source code or a direct link to a code repository for the described methodology.
Open Datasets Yes In the first stage, we train the textto-video video diffusion model using the Web Vid-10M video-text dataset (Bain et al., 2021) and the 300M image-text datasets from LAION-5B (Sohl-Dickstein et al., 2015)., We evaluated our model on the MSR-VTT (Xu et al., 2016) and UCF-101 datasets, Models are trained with a subset of 100,000 videos from the HD-100M (Xue et al., 2021) dataset to avoid watermarks.
Dataset Splits No The paper mentions training and evaluation datasets (Web Vid-10M, LAION-5B, MSR-VTT, UCF-101, HD-100M), and the MSR-VTT 'test set' but does not provide specific train/validation/test dataset splits (percentages, counts, or explicit predefined split citations) for reproduction.
Hardware Specification Yes Table 4: GPU-type A100 40 GB A100 40 GB
Software Dependencies No The paper does not provide specific ancillary software details, such as library names with version numbers (e.g., PyTorch 1.9 or CUDA 11.1), needed to replicate the experiment.
Experiment Setup Yes Table 4: Vers Video network details (including Learning rate 5 10 5, Batch size per GPU 16 2, Diffusion steps 1000, CFG scale 7.5).