reproducibilityindex.ai

ControlVideo: Training-free Controllable Text-to-video Generation

Authors: Yabo Zhang, Yuxiang Wei, Dongsheng Jiang, XIAOPENG ZHANG, Wangmeng Zuo, Qi Tian

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate that our Control Video outperforms the state-of-the-arts both quantitatively and qualitatively. It is worth noting that, thanks to the efficient designs, Control Video could generate both short and long videos within several minutes using one NVIDIA 2080Ti.
Researcher Affiliation	Collaboration	1Harbin Institute of Technology 2Huawei Cloud
Pseudocode	Yes	Algorithm 1 Interleaved-frame smoother
Open Source Code	Yes	Code and videos are available at this link.
Open Datasets	Yes	To evaluate our Control Video, we collect 25 object-centric videos from DAVIS dataset (Pont-Tuset et al., 2017) and manually annotate their source descriptions.
Dataset Splits	No	The paper mentions using 125 motion-prompt pairs as an evaluation dataset but does not specify explicit training, validation, or test dataset splits needed for reproduction.
Hardware Specification	Yes	Control Video could generate both short and long videos within several minutes using one NVIDIA 2080Ti.
Software Dependencies	No	The paper mentions Control Net, RIFE, and xFormers but does not provide specific version numbers for these software dependencies.
Experiment Setup	Yes	The synthesized short videos are of length 15, while the long videos usually contain about 100 frames. Unless otherwise noted, their resolution is both 512 512. During sampling, we adopt DDIM sampling (Song et al., 2020a) with 50 timesteps, and interleaved-frame smoother is performed on predicted RGB frames at timesteps {30, 31} by default.