Video Diffusion Models are Training-free Motion Interpreter and Controller
Authors: Zeqi Xiao, Yifan Zhou, Shuai Yang, Xingang Pan
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Various experiments showcase the effectiveness of MOFT in controlling the motions of diverse scenarios across different video diffusion models without the need for any training. |
| Researcher Affiliation | Academia | Zeqi Xiao1, Yifan Zhou1, Shuai Yang2, Xingang Pan1 1S-Lab, Nanyang Technological University, 2Wangxuan Institute of Computer Technology, Peking University {zeqi001, yifan006}@e.ntu.edu.sg williamyang@pku.edu.cn, xingang.pan@ntu.edu.sg |
| Pseudocode | Yes | Algorithm 1: Optimization Process |
| Open Source Code | No | Project page at this URL. (The NeurIPS checklist also explicitly states that code is not provided at submission time: “We will release it later.”) |
| Open Datasets | Yes | We follow [22; 25] that uses an image quality predictor trained on the SPAQ dataset [12] to evaluate frame-wise quality regarding distortion like noise, blur, or over-exposure. |
| Dataset Splits | No | No explicit training/validation/test split percentages or counts are provided. The paper mentions collecting data (e.g., “270 prompt-motion direction pairs”), but does not detail how it was split into train/validation/test sets. |
| Hardware Specification | Yes | It takes approximately 3 minutes to generate one sample on an RTX 3090 GPU. |
| Software Dependencies | No | No specific software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions) are explicitly mentioned. |
| Experiment Setup | Yes | Our results are at a resolution of 512x512 and 16 frames unless otherwise specified. We use DDIM with 25 denoising steps for each sample. (...) In practice, the total denoising step is 25. We set t1 = 19, t2 = 18, t3 = 5. (...) In practice, we apply gradient clipping to the first 8 frames. (...) In practice, we choose the top 4% of motion channels. |