VideoComposer: Compositional Video Synthesis with Motion Controllability
Authors: Xiang Wang, Hangjie Yuan, Shiwei Zhang, Dayou Chen, Jiuniu Wang, Yingya Zhang, Yujun Shen, Deli Zhao, Jingren Zhou
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experimental results suggest that Video Composer is able to control the spatial and temporal patterns simultaneously within a synthesized video in various forms, such as text description, sketch sequence, reference video, or even simply hand-crafted motions. |
| Researcher Affiliation | Industry | 1Alibaba Group 2Ant Group {xiaolao.wx, yuanhangjie.yhj, zhangjin.zsw}@alibaba-inc.com {dayou.cdy, wangjiuniu.wjn, yingya.zyy, jingren.zhou}@alibaba-inc.com {shenyujun0302, zhaodeli}@gmail.com |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code and models are publicly available at https://videocomposer.github.io. |
| Open Datasets | Yes | To optimize Video Composer, we leverage two widely recognized and publicly accessible datasets: Web Vid10M [2] and LAION-400M [51]. |
| Dataset Splits | No | The paper mentions using Web Vid10M and LAION-400M for training and MSR-VTT for text-to-video generation evaluation, but it does not specify explicit training, validation, and test splits for these datasets with percentages or sample counts. |
| Hardware Specification | No | The paper mentions 'GPUs' being used for training, but does not provide specific hardware details such as GPU models (e.g., NVIDIA A100), CPU types, or cloud instance specifications. |
| Software Dependencies | No | The paper mentions using FlashAttention [12] and extending Stable Diffusion 2, but does not provide specific version numbers for these or other software libraries or frameworks required for replication. |
| Experiment Setup | Yes | We adopt Adam W [35] as the default optimizer with a learning rate set to 5 10 5. In total, Video Composer is pre-trained for 400k steps, with the first and second stage being pre-trained for 132k steps and 268k steps, respectively. We use center crop and randomly sample video frames to compose the video input whose F = 16, H = 256 and W = 256. During the second stage pre-training, we adhere to [28], using a probability of 0.1 to keep all conditions, a probability of 0.1 to discard all conditions, and an independent probability of 0.5 to keep or discard a specific condition. |