Collaborative Video Diffusion: Consistent Multi-video Generation with Camera Control
Authors: Zhengfei Kuang, Shengqu Cai, Hao He, Yinghao Xu, Hongsheng Li, Leonidas J. Guibas, Gordon Wetzstein
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Trained on top of a state-of-the-art camera-control module for video generation, CVD generates multiple videos rendered from different camera trajectories with significantly better consistency than baselines, as shown in extensive experiments. |
| Researcher Affiliation | Academia | 1Stanford Univerity 2CUHK |
| Pseudocode | Yes | We provide the pseudo-code of our inference algorithm and detailed mathematical analysis in our supplementary. ... Algorithm 1: Algorithm for arbitrary number of videos generation |
| Open Source Code | Yes | Project page: https://collaborativevideodiffusion.github.io/. |
| Open Datasets | Yes | The model is trained with two different datasets: Real Estate10K [68] and Web Vid10M [1]... |
| Dataset Splits | Yes | We select 65,000 videos from Real Estate10K [68] and 2,400,000 videos from Web Vid10M [1] to train our model. Each data point consists of two videos of 16 frames and their corresponding camera extrinsic and intrinsic parameters. For Real Estate10K, we randomly sample a 31-frame clip from the original video and split it into two videos using the method described in the paper. For Web Vid10M, we sample a 16-frame clip, duplicate it to create two videos, and then apply random homography deformations to the second video. |
| Hardware Specification | Yes | All models are trained on 8 NVIDIA A100 GPUs for 100k iterations using an effective batch size 8. |
| Software Dependencies | No | The paper mentions software like Animate Diff [17], Camera Ctrl [18], Stable Diffusion [44], and Adam optimizer [30], but does not provide specific version numbers for any of them. |
| Experiment Setup | Yes | Following [18], we use the Adam optimizer [30] with learning rate 1e 4. During training, we freeze the vanilla parameters from our backbones and optimize only our newly injected layers. We mix the data points from Real Estate10K and Web Vid10M under the ratio of 7 : 3 and train the model in two phases alternatively. ... We use DDIM [51] scheduler with 1000 steps during training and 25 steps during inference. |