Motion Consistency Model: Accelerating Video Diffusion with Disentangled Motion-Appearance Distillation
Authors: Yuanhao Zhai, Kevin Lin, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Chung-Ching Lin, DAVID DOERMANN, Junsong Yuan, Lijuan Wang
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments show that our MCM achieves the state-of-the-art video diffusion distillation performance. Additionally, our method can enhance frame quality in video diffusion models, producing frames with high aesthetic scores or specific styles without corresponding video data. ... We conduct extensive experiments, demonstrating that our MCM significantly improves video diffusion distillation performance. Furthermore, when leveraging an additional image dataset, our MCM better aligns the appearance of the generated video with the high-quality image dataset. |
| Researcher Affiliation | Collaboration | State University of New York at Buffalo Microsoft {yzhai6,doermann,jsyuan}@buffalo.edu {keli,zhengyang,lindsey.li,jianfw,chungching.lin,lijuanw}@microsoft.com |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper states: 'Our model, code, and data will be made public available upon acceptance.' This indicates future release, not concrete access at the time of publication. |
| Open Datasets | Yes | We choose two text-to-video diffusion models for experiments: Model Scope T2V [62] and Animate Diff [19] with Stable Diffusion v1.5 [45]. We use the Web Vid 2M [5] as both the video and image training dataset, without using any additional image datasets. |
| Dataset Splits | No | The paper mentions training, validation, and test sets (e.g., 'we randomly sample 500 validation videos from Web Vid 2M (Web Vid mini) for in-distribution evaluation; we also follow common practice [62, 16] to use 2900 validation videos from MSRVTT [70] for zero-shot generation evaluation.'), but it does not specify the exact split percentages or absolute counts for training, validation, and test splits in a way that would allow direct reproduction of the data partitioning. |
| Hardware Specification | Yes | The experiments are conducted on a machine equipped with 32 H100 GPUs. |
| Software Dependencies | No | The paper lists software like 'PyTorch' [4], 'Diffusers' [60], and 'PEFT' [40] but does not specify their version numbers. |
| Experiment Setup | Yes | The learning rates for the diffusion model and discriminator are set to 5e 6 and 5e 5, respectively, with batch size 128, Adam optimizer [30], and 30k training steps. The weight hyperparameters are determined via a grid search: λadv = 1 and λreal = 0.5. |