Human Motion Diffusion Model
Authors: Guy Tevet, Sigal Raab, Brian Gordon, Yoni Shafir, Daniel Cohen-or, Amit Haim Bermano
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show that our model is trained with lightweight resources and yet achieves state-of-the-art results on leading benchmarks for text-to-motion and action-to-motion. Overall, we introduce Motion Diffusion Model, a motion framework that achieves state-of-the-art quality in several motion generation tasks. We evaluate our model using two leading benchmarks KIT (Plappert et al., 2016) and Human ML3D (Guo et al., 2022a), over the set of metrics suggested by Guo et al. (2022a): R-precision and Multimodal-Dist measure the relevancy of the generated motions to the input prompts, FID measures the dissimilarity between the generated and ground truth distributions (in latent space), Diversity measures the variability in the resulting motion distribution, and Multi Modality is the average variance given a single text prompt. We conduct a thorough experiment to evaluate the contribution of geometric losses with the Human Act12 dataset. |
| Researcher Affiliation | Academia | Guy Tevet, Sigal Raab, Brian Gordon, Yonatan Shafir, Daniel Cohen-Or and Amit H. Bermano Tel Aviv University, Israel guytevet@mail.tau.ac.il |
| Pseudocode | No | No structured pseudocode or algorithm blocks labeled 'Pseudocode' or 'Algorithm' are present in the paper. Figure 2 shows a diagram of the model and sampling process, but it is not pseudocode. |
| Open Source Code | Yes | 1Code can be found at https://github.com/Guy Tevet/motion-diffusion-model. The full implementation of MDM can be found in our published code2. https://github.com/Guy Tevet/motion-diffusion-model |
| Open Datasets | Yes | We evaluate our model using two leading benchmarks KIT (Plappert et al., 2016) and Human ML3D (Guo et al., 2022a). Human ML3D is a recent dataset, textually re-annotating motion capture from the AMASS (Mahmood et al., 2019) and Human Act12 (Guo et al., 2020) collections. Two dataset are commonly used to evaluate action-to-motion models: Human Act12 (Guo et al., 2020) and UESTC (Ji et al., 2018). |
| Dataset Splits | No | The paper mentions test sets for evaluation (e.g., 'Human ML3D test set', 'KIT test set') and cross-subject testing for UESTC, but does not provide specific details for training, validation, and test dataset splits (e.g., percentages, sample counts, or explicit references to predefined splits for all sets used). |
| Hardware Specification | Yes | All of them have been trained on a single NVIDIA Ge Force RTX 2080 Ti GPU for a period of about 3 days. |
| Software Dependencies | No | The paper mentions using Py Torch for implementation and specific models like CLIPVi T-B/32 and sentence-BERT, and references a DDPM implementation by Dhariwal & Nichol, but it does not provide specific version numbers for these software components or libraries. |
| Experiment Setup | Yes | All of them have been trained on a single NVIDIA Ge Force RTX 2080 Ti GPU for a period of about 3 days. Our models have been trained with T = 1000 noising steps and a cosine noise schedule. Our models were trained with batch size 64, 8 layers (except GRU that was optimal at 2), and latent dimension 512. We evaluate our models with guidance-scale s = 2.5. The experiments have been run with batch size 64, a latent dimension of 512, and an encoder-transformer architecture. Training on Human Act12 and UESTC has been carried out for 750K and 2M steps respectively. We used 8 transformer layers, 4 attention heads, latent dimension d = 512, dropout 0.1, feed-forward size 1024 and gelu activations. For all of our experiments, we use batch size 64, learning rate 10 4. |