Enhancing Motion in Text-to-Video Generation with Decomposed Encoding and Conditioning
Authors: PENGHUI RUAN, Pichao WANG, Divya Saxena, Jiannong Cao, Yuhui Shi
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Evaluations on benchmarks such as MSR-VTT, UCF-101, Web Vid-10M, Eval Crafter, and VBench demonstrate DEMO s superior ability to produce videos with enhanced motion dynamics while maintaining high visual quality. |
| Researcher Affiliation | Collaboration | Penghui Ruan1,2, Pichao Wang3 , Divya Saxena1, Jiannong Cao1 , Yuhui Shi2 1The Hong Kong Polytechnic University, Hong Kong, China 2Southern University of Science and Technology, Shenzhen, China 3Amazon, Seattle, United States |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Project page: https://PR-Ryan.github.io/DEMO-project/ [Abstract]... We provide code for our proposed method in supplementary materials. |
| Open Datasets | Yes | Specifically, we use Web Vid-10M [1], a large-scale dataset of short videos with textual descriptions as our fine-tuning dataset. |
| Dataset Splits | Yes | For Web Vid-10M [1], we perform T2V generation on the validation set. As shown in Table 3, we evaluate the FID, FVD, and CLIPSIM, where we randomly sample 5K text-video pairs from the validation set. |
| Hardware Specification | Yes | DEMO is trained on 4 NVIDIA Tesla A100 GPUs with a batch size of 24 per GPU. |
| Software Dependencies | No | The paper mentions software components such as 'Adam optimizer', 'Deepspeed framework', 'VQGAN', 'DDIM sampler', 'Raft' and 'CLIP Vi T-H/14' but does not provide specific version numbers for these dependencies. |
| Experiment Setup | Yes | As shown in Table 7, we train DEMO using the Adam optimizer [25] with a One Cycle scheduler [47]. Specifically, the learning rate varies within the range of [0.00001, 0.00005], while the momentum oscillates between 0.85 and 0.99. ...DEMO is trained with 1000 diffusion steps. We set the classifier-free guidance scale as 9 with the probability of 0.1 randomly dropping the text during training. For inference, we use the DDIM sampler [49] with 50 inference steps. |