MoMu-Diffusion: On Learning Long-Term Motion-Music Synchronization and Correspondence
Authors: Fuming You, Minghui Fang, Li Tang, Rongjie Huang, Yongqi Wang, Zhou Zhao
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate that Mo Mu-Diffusion surpasses recent state-of-the-art methods both qualitatively and quantitatively, and can synthesize realistic, diverse, long-term, and beat-matched music or motion sequences. We have conducted extensive experiments on three motion-to-music and two music-to-motion datasets, including scenarios such as dancing and competitive sports. |
| Researcher Affiliation | Academia | Fuming You, Minghui Fang, Li Tang, Rongjie Huang, Yongqi Wang, Zhou Zhao Zhejiang University fumyou13@gmail.com |
| Pseudocode | Yes | We provide the pseudo-codes of cross-modal generation and multi-modal joint generation in Algorithm 1 and 2, respectively. |
| Open Source Code | Yes | The generated samples and codes are available at https://momu-diffusion.github.io/. |
| Open Datasets | Yes | We evaluate our method on the latest LORIS benchmark [49], which contains 86.43 hours of video samples synchronized with music. This benchmark presents three demanding scenarios: AIST++ Dance [30], Floor Exercise [42], and Figure Skating [47, 46]. ... We use two datasets: AIST++ Dance [30] and BHS Dance [26]. |
| Dataset Splits | Yes | In our experiments, each dataset is randomly split with a 90%/5%/5% proportion for training, validation, and testing. |
| Hardware Specification | Yes | We use 8 NVIDIA 4090 GPUs and it takes about 12 hours to finish. It takes about 2 days for 8 NVIDIA 4090 GPUs. |
| Software Dependencies | No | Here is the Python code based on the Librosa library: librosa.onset.onset_detect(y=audio, sr=sampling_rate, wait=1, delta=0.2, pre_avg=3, post_avg=3, pre_max=3, post_max=3, units= time ). Open Pose [3] is applied to extract 2D body keypoints. The paper mentions software like Librosa and OpenPose, but does not provide specific version numbers for these software components or other libraries used for experiments. |
| Experiment Setup | Yes | The detailed hyper-parameters of Bi Co R-VAE are listed in Table 8. The hyper-parameters of our FFT model are listed in Table 9. For training Bi Co R-VAE, we use the Adam W optimizer with a learning rate of 2e-4 and training epochs of 300. The FFT diffusion model is trained by the Adam W optimizer [23] with a learning rate of 1.6e-5 and a lambda linear scheduler with a warmup step of 10000. We train the diffusion model with 200 epochs for each task. |