Efficient Video Diffusion Models via Content-Frame Motion-Latent Decomposition

Authors: Sihyun Yu, Weili Nie, De-An Huang, Boyi Li, Jinwoo Shin, Anima Anandkumar

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We validate the effectiveness of CMD on popular video generation benchmarks, including UCF101 (Soomro et al., 2012) and Web Vid-10M (Bain et al., 2021). For instance, measured with FVD (Unterthiner et al. 2018; lower is better), our method achieves 238.3 in text-to-video (T2V) generation on Web Vid-10M, 18.5% better than the prior state-of-the-art of 292.4. We show the memory and computation efficiency of CMD. For instance, to generate a single video of resolution 512 1024 and length 16, CMD only requires 5.56GB memory and 46.83 TFLOPs, while recent Modelscope (Wang et al., 2023a) requires 8.51GB memory and 938.9 TFLOPs, significantly larger than the requirements of CMD (see Figure 1). In Section 4.1, we provide setups for our experiments. In Section 4.2, we present the main results, including qualitative results of visualizing generated videos. Finally, in Section 4.3, we conduct extensive analysis to validate the effect of each component as well as to show the efficiency of CMD in various aspects, compared with previous text-to-video generation methods.
Researcher Affiliation Collaboration Sihyun Yu1 Weili Nie2 De-An Huang2 Boyi Li2,3 Jinwoo Shin1 Anima Anandkumar4 1KAIST 2NVIDIA Corporation 3UC Berkeley 4Caltech
Pseudocode Yes We summarize the sampling procedure of CMD in Algorithm 1. Algorithm 1 content-motion latent diffusion model (CMD)
Open Source Code No The paper links to a "Project page: https://sihyun.me/CMD" which is a project demonstration page, not an explicit statement of code release for the methodology described in the paper, nor a direct link to a code repository. It also references a third-party implementation: "Our motion diffusion model implementation heavily follows the official implementation of Di T (Peebles & Xie, 2023), including hyperparameters and training objectives used.* https://github.com/facebookresearch/Di T" which is not their own code.
Open Datasets Yes We mainly consider UCF-101 (Soomro et al., 2012) and Web Vid-10M (Bain et al., 2021) for the evaluation. We also use MSR-VTT (Xu et al., 2016) for a zero-shot evaluation of the text-to-video models.
Dataset Splits Yes Web Vid-10M is a dataset that consists of 10,727,607 text-video pairs as training split. The dataset also contains a validation split that is composed of 5,000 text-video pairs. We use train split for training the model and use a validation set for evaluation.
Hardware Specification Yes FLOPs and memory consumption are measured with a single NVIDIA A100 40GB GPU to generate a single video of a resolution 512 1024 and length 16. All values are measured with a single NVIDIA A100 80GB GPU with mixed precision. We use 8 NVIDIA A100 80GB GPUs for training with a batch size of 24. We use 8 and 32 NVIDIA A100 80GB GPUs to train the model on UCF-101 and Web Vid (respectively) and use a batch size of 256. We use 16 and 64 NVIDIA A100 80GB GPUs to train the model on UCF-101 and Web Vid (respectively) and use a batch size of 256.
Software Dependencies No The paper mentions using the Adam optimizer, PyTorch models, and the Fvcore library for FLOPs measurement, but does not provide specific version numbers for these software components.
Experiment Setup Yes In all experiments, videos are clipped to 16 frames for both training and evaluation. For a video autoencoder, we use Time Sformer (Bertasius et al., 2021) as a backbone. For the content frame diffusion model, we use pretrained Stable Diffusion (SD) 1.5 and 2.1-base (Rombach et al., 2022). For the motion diffusion model, we use Di T-L/2 (for UCF-101) and Di T-XL/2 (for Web Vid-10M). We use the Adam optimizer (Kingma & Ba, 2015) with a learning rate of 1e-5, (β1, β2) = (0.5, 0.9) without weight decay. We use 8 NVIDIA A100 80GB GPUs for training with a batch size of 24. We use the Adam optimizer with a learning rate of 1e-4, (β1, β2) = (0.9, 0.999) and without weight decay. We use a batch size of 256. We use the DDIM (Song et al., 2021a) sampler. We use η = 0.0 for both models (i.e., without additional random noises in sampling), and we use the number of steps as 100 and 50 for the motion diffusion model and the content frame diffusion model, respectively. For the content frame diffusion model, we use the classifier guidance scale w = 4.0 on UCF-101 and w = 7.5 on text-to-video generation.