Enhancing Motion in Text-to-Video Generation with Decomposed Encoding and Conditioning

Authors: PENGHUI RUAN, Pichao WANG, Divya Saxena, Jiannong Cao, Yuhui Shi

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Evaluations on benchmarks such as MSR-VTT, UCF-101, Web Vid-10M, Eval Crafter, and VBench demonstrate DEMO s superior ability to produce videos with enhanced motion dynamics while maintaining high visual quality.
Researcher Affiliation Collaboration Penghui Ruan1,2, Pichao Wang3 , Divya Saxena1, Jiannong Cao1 , Yuhui Shi2 1The Hong Kong Polytechnic University, Hong Kong, China 2Southern University of Science and Technology, Shenzhen, China 3Amazon, Seattle, United States
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes Project page: https://PR-Ryan.github.io/DEMO-project/ [Abstract]... We provide code for our proposed method in supplementary materials.
Open Datasets Yes Specifically, we use Web Vid-10M [1], a large-scale dataset of short videos with textual descriptions as our fine-tuning dataset.
Dataset Splits Yes For Web Vid-10M [1], we perform T2V generation on the validation set. As shown in Table 3, we evaluate the FID, FVD, and CLIPSIM, where we randomly sample 5K text-video pairs from the validation set.
Hardware Specification Yes DEMO is trained on 4 NVIDIA Tesla A100 GPUs with a batch size of 24 per GPU.
Software Dependencies No The paper mentions software components such as 'Adam optimizer', 'Deepspeed framework', 'VQGAN', 'DDIM sampler', 'Raft' and 'CLIP Vi T-H/14' but does not provide specific version numbers for these dependencies.
Experiment Setup Yes As shown in Table 7, we train DEMO using the Adam optimizer [25] with a One Cycle scheduler [47]. Specifically, the learning rate varies within the range of [0.00001, 0.00005], while the momentum oscillates between 0.85 and 0.99. ...DEMO is trained with 1000 diffusion steps. We set the classifier-free guidance scale as 9 with the probability of 0.1 randomly dropping the text during training. For inference, we use the DDIM sampler [49] with 50 inference steps.