reproducibilityindex.ai

Enhancing Motion in Text-to-Video Generation with Decomposed Encoding and Conditioning

Authors: PENGHUI RUAN, Pichao WANG, Divya Saxena, Jiannong Cao, Yuhui Shi

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Evaluations on benchmarks such as MSR-VTT, UCF-101, Web Vid-10M, Eval Crafter, and VBench demonstrate DEMO s superior ability to produce videos with enhanced motion dynamics while maintaining high visual quality.
Researcher Affiliation	Collaboration	Penghui Ruan1,2, Pichao Wang3 , Divya Saxena1, Jiannong Cao1 , Yuhui Shi2 1The Hong Kong Polytechnic University, Hong Kong, China 2Southern University of Science and Technology, Shenzhen, China 3Amazon, Seattle, United States
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code	Yes	Project page: https://PR-Ryan.github.io/DEMO-project/ [Abstract]... We provide code for our proposed method in supplementary materials.
Open Datasets	Yes	Specifically, we use Web Vid-10M [1], a large-scale dataset of short videos with textual descriptions as our fine-tuning dataset.
Dataset Splits	Yes	For Web Vid-10M [1], we perform T2V generation on the validation set. As shown in Table 3, we evaluate the FID, FVD, and CLIPSIM, where we randomly sample 5K text-video pairs from the validation set.
Hardware Specification	Yes	DEMO is trained on 4 NVIDIA Tesla A100 GPUs with a batch size of 24 per GPU.
Software Dependencies	No	The paper mentions software components such as 'Adam optimizer', 'Deepspeed framework', 'VQGAN', 'DDIM sampler', 'Raft' and 'CLIP Vi T-H/14' but does not provide specific version numbers for these dependencies.
Experiment Setup	Yes	As shown in Table 7, we train DEMO using the Adam optimizer [25] with a One Cycle scheduler [47]. Specifically, the learning rate varies within the range of [0.00001, 0.00005], while the momentum oscillates between 0.85 and 0.99. ...DEMO is trained with 1000 diffusion steps. We set the classifier-free guidance scale as 9 with the probability of 0.1 randomly dropping the text during training. For inference, we use the DDIM sampler [49] with 50 inference steps.