reproducibilityindex.ai

VideoComposer: Compositional Video Synthesis with Motion Controllability

Authors: Xiang Wang, Hangjie Yuan, Shiwei Zhang, Dayou Chen, Jiuniu Wang, Yingya Zhang, Yujun Shen, Deli Zhao, Jingren Zhou

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experimental results suggest that Video Composer is able to control the spatial and temporal patterns simultaneously within a synthesized video in various forms, such as text description, sketch sequence, reference video, or even simply hand-crafted motions.
Researcher Affiliation	Industry	1Alibaba Group 2Ant Group {xiaolao.wx, yuanhangjie.yhj, zhangjin.zsw}@alibaba-inc.com {dayou.cdy, wangjiuniu.wjn, yingya.zyy, jingren.zhou}@alibaba-inc.com {shenyujun0302, zhaodeli}@gmail.com
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code	Yes	The code and models are publicly available at https://videocomposer.github.io.
Open Datasets	Yes	To optimize Video Composer, we leverage two widely recognized and publicly accessible datasets: Web Vid10M [2] and LAION-400M [51].
Dataset Splits	No	The paper mentions using Web Vid10M and LAION-400M for training and MSR-VTT for text-to-video generation evaluation, but it does not specify explicit training, validation, and test splits for these datasets with percentages or sample counts.
Hardware Specification	No	The paper mentions 'GPUs' being used for training, but does not provide specific hardware details such as GPU models (e.g., NVIDIA A100), CPU types, or cloud instance specifications.
Software Dependencies	No	The paper mentions using FlashAttention [12] and extending Stable Diffusion 2, but does not provide specific version numbers for these or other software libraries or frameworks required for replication.
Experiment Setup	Yes	We adopt Adam W [35] as the default optimizer with a learning rate set to 5 10 5. In total, Video Composer is pre-trained for 400k steps, with the first and second stage being pre-trained for 132k steps and 268k steps, respectively. We use center crop and randomly sample video frames to compose the video input whose F = 16, H = 256 and W = 256. During the second stage pre-training, we adhere to [28], using a probability of 0.1 to keep all conditions, a probability of 0.1 to discard all conditions, and an independent probability of 0.5 to keep or discard a specific condition.