VideoComposer: Compositional Video Synthesis with Motion Controllability

Authors: Xiang Wang, Hangjie Yuan, Shiwei Zhang, Dayou Chen, Jiuniu Wang, Yingya Zhang, Yujun Shen, Deli Zhao, Jingren Zhou

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experimental results suggest that Video Composer is able to control the spatial and temporal patterns simultaneously within a synthesized video in various forms, such as text description, sketch sequence, reference video, or even simply hand-crafted motions.
Researcher Affiliation Industry 1Alibaba Group 2Ant Group {xiaolao.wx, yuanhangjie.yhj, zhangjin.zsw}@alibaba-inc.com {dayou.cdy, wangjiuniu.wjn, yingya.zyy, jingren.zhou}@alibaba-inc.com {shenyujun0302, zhaodeli}@gmail.com
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes The code and models are publicly available at https://videocomposer.github.io.
Open Datasets Yes To optimize Video Composer, we leverage two widely recognized and publicly accessible datasets: Web Vid10M [2] and LAION-400M [51].
Dataset Splits No The paper mentions using Web Vid10M and LAION-400M for training and MSR-VTT for text-to-video generation evaluation, but it does not specify explicit training, validation, and test splits for these datasets with percentages or sample counts.
Hardware Specification No The paper mentions 'GPUs' being used for training, but does not provide specific hardware details such as GPU models (e.g., NVIDIA A100), CPU types, or cloud instance specifications.
Software Dependencies No The paper mentions using FlashAttention [12] and extending Stable Diffusion 2, but does not provide specific version numbers for these or other software libraries or frameworks required for replication.
Experiment Setup Yes We adopt Adam W [35] as the default optimizer with a learning rate set to 5 10 5. In total, Video Composer is pre-trained for 400k steps, with the first and second stage being pre-trained for 132k steps and 268k steps, respectively. We use center crop and randomly sample video frames to compose the video input whose F = 16, H = 256 and W = 256. During the second stage pre-training, we adhere to [28], using a probability of 0.1 to keep all conditions, a probability of 0.1 to discard all conditions, and an independent probability of 0.5 to keep or discard a specific condition.