Collaborative Video Diffusion: Consistent Multi-video Generation with Camera Control

Authors: Zhengfei Kuang, Shengqu Cai, Hao He, Yinghao Xu, Hongsheng Li, Leonidas J. Guibas, Gordon Wetzstein

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Trained on top of a state-of-the-art camera-control module for video generation, CVD generates multiple videos rendered from different camera trajectories with significantly better consistency than baselines, as shown in extensive experiments.
Researcher Affiliation Academia 1Stanford Univerity 2CUHK
Pseudocode Yes We provide the pseudo-code of our inference algorithm and detailed mathematical analysis in our supplementary. ... Algorithm 1: Algorithm for arbitrary number of videos generation
Open Source Code Yes Project page: https://collaborativevideodiffusion.github.io/.
Open Datasets Yes The model is trained with two different datasets: Real Estate10K [68] and Web Vid10M [1]...
Dataset Splits Yes We select 65,000 videos from Real Estate10K [68] and 2,400,000 videos from Web Vid10M [1] to train our model. Each data point consists of two videos of 16 frames and their corresponding camera extrinsic and intrinsic parameters. For Real Estate10K, we randomly sample a 31-frame clip from the original video and split it into two videos using the method described in the paper. For Web Vid10M, we sample a 16-frame clip, duplicate it to create two videos, and then apply random homography deformations to the second video.
Hardware Specification Yes All models are trained on 8 NVIDIA A100 GPUs for 100k iterations using an effective batch size 8.
Software Dependencies No The paper mentions software like Animate Diff [17], Camera Ctrl [18], Stable Diffusion [44], and Adam optimizer [30], but does not provide specific version numbers for any of them.
Experiment Setup Yes Following [18], we use the Adam optimizer [30] with learning rate 1e 4. During training, we freeze the vanilla parameters from our backbones and optimize only our newly injected layers. We mix the data points from Real Estate10K and Web Vid10M under the ratio of 7 : 3 and train the model in two phases alternatively. ... We use DDIM [51] scheduler with 1000 steps during training and 25 steps during inference.