Fast and Memory-Efficient Video Diffusion Using Streamlined Inference
Authors: Zheng Zhan, Yushu Wu, Yifan Gong, Zichong Meng, Zhenglun Kong, Changdi Yang, Geng Yuan, Pu Zhao, Wei Niu, Yanzhi Wang
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate that our approach significantly reduces peak memory and computational overhead, making it feasible to generate high-quality videos on a single consumer GPU (e.g., reducing peak memory of Animate Diff from 42GB to 11GB, featuring faster inference on 2080Ti)1. |
| Researcher Affiliation | Academia | Zheng Zhan1 Yushu Wu1 Yifan Gong1 Zichong Meng1 Zhenglun Kong12 Changdi Yang1 Geng Yuan3 Pu Zhao1 Wei Niu3 Yanzhi Wang1 1Northeastern University 2Harvard University 3University of Georgia |
| Pseudocode | Yes | Algorithm 1 Key step search in step rehash |
| Open Source Code | Yes | 1Code available at: https://github.com/wuyushuwys/FMEDiffusion |
| Open Datasets | Yes | Zero-shot UCF-101 [33]: We sample clips from each categories of UCF-101 dataset, and gather a subset with 1,000 video clips for evaluation. Their action categories are considered as their captions. For SVD and SVD-XT, our samples are generated at a resolution of 576 1024 (14 frames for SVD and 25 frames for SVD-XT) and then resize to 240 320. For Animate Diff, we generate samples with resolution 512 512 (16 frames). Zero-shot MSR-VTT [41]: We generated a video sample for each of the 9,940 development prompts. |
| Dataset Splits | No | The paper describes the datasets used for evaluation but does not specify explicit training/validation/test splits, percentages, or a clear methodology for data partitioning for reproducibility. |
| Hardware Specification | Yes | All experiments are conducted on a NVIDIA A100 GPU. |
| Software Dependencies | No | The paper mentions 'Torch Metrics [1]' and implies the use of PyTorch (via a tutorial link), but it does not provide specific version numbers for these or any other software dependencies. |
| Experiment Setup | Yes | We use pretained weight for SVD (I2V) and Animate Diff (T2V). We compare the proposed Streamlined Inference (use 13 full computation steps) with the original inference (use 25 full computation steps) and naïve slicing inference as mentioned in Sec.3. |