Video Diffusion Models are Training-free Motion Interpreter and Controller

Authors: Zeqi Xiao, Yifan Zhou, Shuai Yang, Xingang Pan

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Various experiments showcase the effectiveness of MOFT in controlling the motions of diverse scenarios across different video diffusion models without the need for any training.
Researcher Affiliation Academia Zeqi Xiao1, Yifan Zhou1, Shuai Yang2, Xingang Pan1 1S-Lab, Nanyang Technological University, 2Wangxuan Institute of Computer Technology, Peking University {zeqi001, yifan006}@e.ntu.edu.sg williamyang@pku.edu.cn, xingang.pan@ntu.edu.sg
Pseudocode Yes Algorithm 1: Optimization Process
Open Source Code No Project page at this URL. (The NeurIPS checklist also explicitly states that code is not provided at submission time: “We will release it later.”)
Open Datasets Yes We follow [22; 25] that uses an image quality predictor trained on the SPAQ dataset [12] to evaluate frame-wise quality regarding distortion like noise, blur, or over-exposure.
Dataset Splits No No explicit training/validation/test split percentages or counts are provided. The paper mentions collecting data (e.g., “270 prompt-motion direction pairs”), but does not detail how it was split into train/validation/test sets.
Hardware Specification Yes It takes approximately 3 minutes to generate one sample on an RTX 3090 GPU.
Software Dependencies No No specific software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions) are explicitly mentioned.
Experiment Setup Yes Our results are at a resolution of 512x512 and 16 frames unless otherwise specified. We use DDIM with 25 denoising steps for each sample. (...) In practice, the total denoising step is 25. We set t1 = 19, t2 = 18, t3 = 5. (...) In practice, we apply gradient clipping to the first 8 frames. (...) In practice, we choose the top 4% of motion channels.