ControlVideo: Training-free Controllable Text-to-video Generation
Authors: Yabo Zhang, Yuxiang Wei, Dongsheng Jiang, XIAOPENG ZHANG, Wangmeng Zuo, Qi Tian
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate that our Control Video outperforms the state-of-the-arts both quantitatively and qualitatively. It is worth noting that, thanks to the efficient designs, Control Video could generate both short and long videos within several minutes using one NVIDIA 2080Ti. |
| Researcher Affiliation | Collaboration | 1Harbin Institute of Technology 2Huawei Cloud |
| Pseudocode | Yes | Algorithm 1 Interleaved-frame smoother |
| Open Source Code | Yes | Code and videos are available at this link. |
| Open Datasets | Yes | To evaluate our Control Video, we collect 25 object-centric videos from DAVIS dataset (Pont-Tuset et al., 2017) and manually annotate their source descriptions. |
| Dataset Splits | No | The paper mentions using 125 motion-prompt pairs as an evaluation dataset but does not specify explicit training, validation, or test dataset splits needed for reproduction. |
| Hardware Specification | Yes | Control Video could generate both short and long videos within several minutes using one NVIDIA 2080Ti. |
| Software Dependencies | No | The paper mentions Control Net, RIFE, and xFormers but does not provide specific version numbers for these software dependencies. |
| Experiment Setup | Yes | The synthesized short videos are of length 15, while the long videos usually contain about 100 frames. Unless otherwise noted, their resolution is both 512 512. During sampling, we adopt DDIM sampling (Song et al., 2020a) with 50 timesteps, and interleaved-frame smoother is performed on predicted RGB frames at timesteps {30, 31} by default. |