ControlVideo: Training-free Controllable Text-to-video Generation

Authors: Yabo Zhang, Yuxiang Wei, Dongsheng Jiang, XIAOPENG ZHANG, Wangmeng Zuo, Qi Tian

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that our Control Video outperforms the state-of-the-arts both quantitatively and qualitatively. It is worth noting that, thanks to the efficient designs, Control Video could generate both short and long videos within several minutes using one NVIDIA 2080Ti.
Researcher Affiliation Collaboration 1Harbin Institute of Technology 2Huawei Cloud
Pseudocode Yes Algorithm 1 Interleaved-frame smoother
Open Source Code Yes Code and videos are available at this link.
Open Datasets Yes To evaluate our Control Video, we collect 25 object-centric videos from DAVIS dataset (Pont-Tuset et al., 2017) and manually annotate their source descriptions.
Dataset Splits No The paper mentions using 125 motion-prompt pairs as an evaluation dataset but does not specify explicit training, validation, or test dataset splits needed for reproduction.
Hardware Specification Yes Control Video could generate both short and long videos within several minutes using one NVIDIA 2080Ti.
Software Dependencies No The paper mentions Control Net, RIFE, and xFormers but does not provide specific version numbers for these software dependencies.
Experiment Setup Yes The synthesized short videos are of length 15, while the long videos usually contain about 100 frames. Unless otherwise noted, their resolution is both 512 512. During sampling, we adopt DDIM sampling (Song et al., 2020a) with 50 timesteps, and interleaved-frame smoother is performed on predicted RGB frames at timesteps {30, 31} by default.