MotionBooth: Motion-Aware Customized Text-to-Video Generation
Authors: Jianzong Wu, Xiangtai Li, Yanhong Zeng, Jiangning Zhang, Qianyu Zhou, Yining Li, Yunhai Tong, Kai Chen
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive quantitative and qualitative evaluations demonstrate the superiority and effectiveness of our method. |
| Researcher Affiliation | Collaboration | Jianzong Wu1,3, Xiangtai Li2,3 , Yanhong Zeng3, Jiangning Zhang4, Qianyu Zhou5, Yining Li3, Kai Chen3, Yunhai Tong1 1PKU 2S-Lab, NTU 3Shanghai AI Laboratory 4ZJU 5SJTU |
| Pseudocode | Yes | Pseudo-code of latent shift. To present the latent shift module more clearly, we show the pseudo-code of the algorithm in Fig. 14. |
| Open Source Code | No | We are not able to provide the code at submission time. But we are making sure that our code and models will be released publically in the future. |
| Open Datasets | Yes | For customization, we collect a total of 26 objects from Dream Booth [42] and Custom Diffusion [30]. |
| Dataset Splits | No | The paper describes the datasets used for customization and evaluation, but does not explicitly provide training, validation, and test dataset splits with percentages or sample counts for these datasets. |
| Hardware Specification | Yes | The training process finishes in around 10 minutes in a single NVIDIA A100 80G GPU. |
| Software Dependencies | No | The paper mentions software components like 'Adam W optimizer' and 'DDIM scheduler' but does not specify their version numbers or the versions of other key software dependencies like PyTorch. |
| Experiment Setup | Yes | We train Motion Booth for 300 steps using the Adam W optimizer, with a learning rate of 5e-2 and a weight decay of 1e-2... The loss weight parameters λ1 and λ2 are set to 1.0 and 0.01. We use Zeroscope and La Vie as base models. During inference, we perform 50-step denoising using the DDIM scheduler and set the classifier-free guidance scale to 7.5. The generated videos are 576x320x24 and 512x320x16 for Zeroscope and La Vie, respectively. |