Fast and Memory-Efficient Video Diffusion Using Streamlined Inference

Authors: Zheng Zhan, Yushu Wu, Yifan Gong, Zichong Meng, Zhenglun Kong, Changdi Yang, Geng Yuan, Pu Zhao, Wei Niu, Yanzhi Wang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that our approach significantly reduces peak memory and computational overhead, making it feasible to generate high-quality videos on a single consumer GPU (e.g., reducing peak memory of Animate Diff from 42GB to 11GB, featuring faster inference on 2080Ti)1.
Researcher Affiliation Academia Zheng Zhan1 Yushu Wu1 Yifan Gong1 Zichong Meng1 Zhenglun Kong12 Changdi Yang1 Geng Yuan3 Pu Zhao1 Wei Niu3 Yanzhi Wang1 1Northeastern University 2Harvard University 3University of Georgia
Pseudocode Yes Algorithm 1 Key step search in step rehash
Open Source Code Yes 1Code available at: https://github.com/wuyushuwys/FMEDiffusion
Open Datasets Yes Zero-shot UCF-101 [33]: We sample clips from each categories of UCF-101 dataset, and gather a subset with 1,000 video clips for evaluation. Their action categories are considered as their captions. For SVD and SVD-XT, our samples are generated at a resolution of 576 1024 (14 frames for SVD and 25 frames for SVD-XT) and then resize to 240 320. For Animate Diff, we generate samples with resolution 512 512 (16 frames). Zero-shot MSR-VTT [41]: We generated a video sample for each of the 9,940 development prompts.
Dataset Splits No The paper describes the datasets used for evaluation but does not specify explicit training/validation/test splits, percentages, or a clear methodology for data partitioning for reproducibility.
Hardware Specification Yes All experiments are conducted on a NVIDIA A100 GPU.
Software Dependencies No The paper mentions 'Torch Metrics [1]' and implies the use of PyTorch (via a tutorial link), but it does not provide specific version numbers for these or any other software dependencies.
Experiment Setup Yes We use pretained weight for SVD (I2V) and Animate Diff (T2V). We compare the proposed Streamlined Inference (use 13 full computation steps) with the original inference (use 25 full computation steps) and naïve slicing inference as mentioned in Sec.3.