Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Enhancing Motion in Text-to-Video Generation with Decomposed Encoding and Conditioning
Authors: PENGHUI RUAN, Pichao WANG, Divya Saxena, Jiannong Cao, Yuhui Shi
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Evaluations on benchmarks such as MSR-VTT, UCF-101, Web Vid-10M, Eval Crafter, and VBench demonstrate DEMO s superior ability to produce videos with enhanced motion dynamics while maintaining high visual quality. |
| Researcher Affiliation | Collaboration | Penghui Ruan1,2, Pichao Wang3 , Divya Saxena1, Jiannong Cao1 , Yuhui Shi2 1The Hong Kong Polytechnic University, Hong Kong, China 2Southern University of Science and Technology, Shenzhen, China 3Amazon, Seattle, United States |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Project page: https://PR-Ryan.github.io/DEMO-project/ [Abstract]... We provide code for our proposed method in supplementary materials. |
| Open Datasets | Yes | Specifically, we use Web Vid-10M [1], a large-scale dataset of short videos with textual descriptions as our fine-tuning dataset. |
| Dataset Splits | Yes | For Web Vid-10M [1], we perform T2V generation on the validation set. As shown in Table 3, we evaluate the FID, FVD, and CLIPSIM, where we randomly sample 5K text-video pairs from the validation set. |
| Hardware Specification | Yes | DEMO is trained on 4 NVIDIA Tesla A100 GPUs with a batch size of 24 per GPU. |
| Software Dependencies | No | The paper mentions software components such as 'Adam optimizer', 'Deepspeed framework', 'VQGAN', 'DDIM sampler', 'Raft' and 'CLIP Vi T-H/14' but does not provide specific version numbers for these dependencies. |
| Experiment Setup | Yes | As shown in Table 7, we train DEMO using the Adam optimizer [25] with a One Cycle scheduler [47]. Specifically, the learning rate varies within the range of [0.00001, 0.00005], while the momentum oscillates between 0.85 and 0.99. ...DEMO is trained with 1000 diffusion steps. We set the classifier-free guidance scale as 9 with the probability of 0.1 randomly dropping the text during training. For inference, we use the DDIM sampler [49] with 50 inference steps. |