ReVideo: Remake a Video with Motion and Content Control

Authors: Chong Mou, Mingdeng Cao, Xintao Wang, Zhaoyang Zhang, Ying Shan, Jian Zhang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that our Re Video has promising performance on several accurate video editing applications, i.e., (1) locally changing video content while keeping the motion constant, (2) keeping content unchanged and customizing new motion trajectories, (3) modifying both content and motion trajectories.
Researcher Affiliation Collaboration Chong Mou1,2,4, Mingdeng Cao3,4, Xintao Wang3 , Zhaoyang Zhang3, Ying Shan3, Jian Zhang1,2,4 1School of Electronic and Computer Engineering, Peking University 2Peking University Shenzhen Graduate School-Rabbitpre AIGC Joint Research Laboratory 3ARC Lab, Tencent PCG 4 University of Tokyo 4Guangdong Provincial Key Laboratory of Ultra High Definition Immersive Media Technology
Pseudocode No The paper describes its methods and architectures verbally and with diagrams, but does not include structured pseudocode or algorithm blocks.
Open Source Code Yes The code for this paper is open-sourced at https://github.com/MC-E/Re Video.
Open Datasets Yes In this work, we choose Stable Video Diffusion (SVD) as the base model. Our three training stages are completed on the Web Vid [2] dataset, which contains 10 million text-video pairs.
Dataset Splits No The paper mentions training on the Web Vid dataset but does not explicitly state the training/validation/test splits, percentages, or sample counts.
Hardware Specification Yes These three stages are optimized for 40K, 30K, and 20K iterations, respectively, with Adam [28] optimizer on 4 NVIDIA A100 GPUs.
Software Dependencies No The paper mentions using Adam [28] optimizer and Co Tracker [22] but does not provide specific version numbers for these or other software dependencies like Python or deep learning frameworks.
Experiment Setup Yes Our three training stages are completed on the Web Vid [2] dataset, which contains 10 million text-video pairs. These three stages are optimized for 40K, 30K, and 20K iterations, respectively, with Adam [28] optimizer on 4 NVIDIA A100 GPUs. The batch size for each GPU is set as 4, with the resolution being 512 × 320.