reproducibilityindex.ai

Video Diffusion Models are Training-free Motion Interpreter and Controller

Authors: Zeqi Xiao, Yifan Zhou, Shuai Yang, Xingang Pan

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Various experiments showcase the effectiveness of MOFT in controlling the motions of diverse scenarios across different video diffusion models without the need for any training.
Researcher Affiliation	Academia	Zeqi Xiao1, Yifan Zhou1, Shuai Yang2, Xingang Pan1 1S-Lab, Nanyang Technological University, 2Wangxuan Institute of Computer Technology, Peking University {zeqi001, yifan006}@e.ntu.edu.sg williamyang@pku.edu.cn, xingang.pan@ntu.edu.sg
Pseudocode	Yes	Algorithm 1: Optimization Process
Open Source Code	No	Project page at this URL. (The NeurIPS checklist also explicitly states that code is not provided at submission time: “We will release it later.”)
Open Datasets	Yes	We follow [22; 25] that uses an image quality predictor trained on the SPAQ dataset [12] to evaluate frame-wise quality regarding distortion like noise, blur, or over-exposure.
Dataset Splits	No	No explicit training/validation/test split percentages or counts are provided. The paper mentions collecting data (e.g., “270 prompt-motion direction pairs”), but does not detail how it was split into train/validation/test sets.
Hardware Specification	Yes	It takes approximately 3 minutes to generate one sample on an RTX 3090 GPU.
Software Dependencies	No	No specific software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions) are explicitly mentioned.
Experiment Setup	Yes	Our results are at a resolution of 512x512 and 16 frames unless otherwise specified. We use DDIM with 25 denoising steps for each sample. (...) In practice, the total denoising step is 25. We set t1 = 19, t2 = 18, t3 = 5. (...) In practice, we apply gradient clipping to the first 8 frames. (...) In practice, we choose the top 4% of motion channels.