Motion Consistency Model: Accelerating Video Diffusion with Disentangled Motion-Appearance Distillation

Authors: Yuanhao Zhai, Kevin Lin, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Chung-Ching Lin, DAVID DOERMANN, Junsong Yuan, Lijuan Wang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments show that our MCM achieves the state-of-the-art video diffusion distillation performance. Additionally, our method can enhance frame quality in video diffusion models, producing frames with high aesthetic scores or specific styles without corresponding video data. ... We conduct extensive experiments, demonstrating that our MCM significantly improves video diffusion distillation performance. Furthermore, when leveraging an additional image dataset, our MCM better aligns the appearance of the generated video with the high-quality image dataset.
Researcher Affiliation Collaboration State University of New York at Buffalo Microsoft {yzhai6,doermann,jsyuan}@buffalo.edu {keli,zhengyang,lindsey.li,jianfw,chungching.lin,lijuanw}@microsoft.com
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code No The paper states: 'Our model, code, and data will be made public available upon acceptance.' This indicates future release, not concrete access at the time of publication.
Open Datasets Yes We choose two text-to-video diffusion models for experiments: Model Scope T2V [62] and Animate Diff [19] with Stable Diffusion v1.5 [45]. We use the Web Vid 2M [5] as both the video and image training dataset, without using any additional image datasets.
Dataset Splits No The paper mentions training, validation, and test sets (e.g., 'we randomly sample 500 validation videos from Web Vid 2M (Web Vid mini) for in-distribution evaluation; we also follow common practice [62, 16] to use 2900 validation videos from MSRVTT [70] for zero-shot generation evaluation.'), but it does not specify the exact split percentages or absolute counts for training, validation, and test splits in a way that would allow direct reproduction of the data partitioning.
Hardware Specification Yes The experiments are conducted on a machine equipped with 32 H100 GPUs.
Software Dependencies No The paper lists software like 'PyTorch' [4], 'Diffusers' [60], and 'PEFT' [40] but does not specify their version numbers.
Experiment Setup Yes The learning rates for the diffusion model and discriminator are set to 5e 6 and 5e 5, respectively, with batch size 128, Adam optimizer [30], and 30k training steps. The weight hyperparameters are determined via a grid search: λadv = 1 and λreal = 0.5.