Vivid-ZOO: Multi-View Video Generation with Diffusion Model
Authors: Bing Li, Cheng Zheng, Wenxuan Zhu, Jinjie Mai, Biao Zhang, Peter Wonka, Bernard Ghanem
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results demonstrate that our method generates high-quality multi-view videos, exhibiting vivid motions, temporal coherence, and multi-view consistency, given a variety of text prompts. |
| Researcher Affiliation | Academia | King Abdullah University of Science and Technology |
| Pseudocode | No | The paper describes methods using text and mathematical equations but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | We will release the code and pretrained models upon acceptance. |
| Open Datasets | No | We construct a dataset named MV-Video Net that provides 14,271 triples of a multi-view video sequence, its associated camera pose sequence, and a text description. (The NeurIPS checklist for 'New Assets' states 'NA' for 'The paper does not release new assets.', which indicates their constructed dataset is not publicly released.) |
| Dataset Splits | No | The paper does not explicitly provide details about training, validation, and test dataset splits, such as percentages or sample counts for each split. |
| Hardware Specification | Yes | MVDream is trained on 32 Nvidia Tesla A100 GPUs, which takes 3 days, and Animate Diff takes around 5 days on 8 A100 GPUs. By combining and reusing the layers of MVDream and Animate Diff, our method only needs to train the proposed 3D-2D alignment and 2D-3D layers, reducing the training cost to around 2 days with 8 A100 GPUs. and Table II: Training settings ... GPU type NVIDIA A100 GPU number 8 |
| Software Dependencies | No | The paper mentions 'DDIMScheduler' and 'Adam W' for training, and 'Cycles 3' for rendering, but does not provide specific version numbers for these or other software dependencies like Python or PyTorch. |
| Experiment Setup | Yes | We train our model using Adam W [54] with a learning rate of 10 4. During training, we process the training data by randomly sampling 4 views that are orthogonal to each other from a multi-view video sequence, reducing the spatial resolution of videos to 256 256, and sample video frames with a stride of 3. Following Animate Diff, we use a linear beta schedule with βstart = 0.00085 and βend = 0.012. and Table II: Training settings... noise scheduler timesteps number 1000... learning rate 0.0001 train step number 100000 batch size 16 |