Vivid-ZOO: Multi-View Video Generation with Diffusion Model

Authors: Bing Li, Cheng Zheng, Wenxuan Zhu, Jinjie Mai, Biao Zhang, Peter Wonka, Bernard Ghanem

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate that our method generates high-quality multi-view videos, exhibiting vivid motions, temporal coherence, and multi-view consistency, given a variety of text prompts.
Researcher Affiliation Academia King Abdullah University of Science and Technology
Pseudocode No The paper describes methods using text and mathematical equations but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No We will release the code and pretrained models upon acceptance.
Open Datasets No We construct a dataset named MV-Video Net that provides 14,271 triples of a multi-view video sequence, its associated camera pose sequence, and a text description. (The NeurIPS checklist for 'New Assets' states 'NA' for 'The paper does not release new assets.', which indicates their constructed dataset is not publicly released.)
Dataset Splits No The paper does not explicitly provide details about training, validation, and test dataset splits, such as percentages or sample counts for each split.
Hardware Specification Yes MVDream is trained on 32 Nvidia Tesla A100 GPUs, which takes 3 days, and Animate Diff takes around 5 days on 8 A100 GPUs. By combining and reusing the layers of MVDream and Animate Diff, our method only needs to train the proposed 3D-2D alignment and 2D-3D layers, reducing the training cost to around 2 days with 8 A100 GPUs. and Table II: Training settings ... GPU type NVIDIA A100 GPU number 8
Software Dependencies No The paper mentions 'DDIMScheduler' and 'Adam W' for training, and 'Cycles 3' for rendering, but does not provide specific version numbers for these or other software dependencies like Python or PyTorch.
Experiment Setup Yes We train our model using Adam W [54] with a learning rate of 10 4. During training, we process the training data by randomly sampling 4 views that are orthogonal to each other from a multi-view video sequence, reducing the spatial resolution of videos to 256 256, and sample video frames with a stride of 3. Following Animate Diff, we use a linear beta schedule with βstart = 0.00085 and βend = 0.012. and Table II: Training settings... noise scheduler timesteps number 1000... learning rate 0.0001 train step number 100000 batch size 16