MoGenTS: Motion Generation based on Spatial-Temporal Joint Modeling
Authors: Weihao Yuan, Yisheng HE, Weichao Shen, Yuan Dong, Xiaodong Gu, Zilong Dong, Liefeng Bo, Qixing Huang
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate that our method significantly outperforms previous methods across different datasets, with a 26.6% decrease of FID on Human ML3D and a 29.9% decrease on KIT-ML. |
| Researcher Affiliation | Collaboration | 1 Alibaba Group 2 The University of Texas at Austin |
| Pseudocode | No | The paper describes the methodology using text and figures, but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The code is not included for now. But we will release the code to the public soon. |
| Open Datasets | Yes | We evaluate our text-to-motion model on Human ML3D [5] and KIT-ML [55] datasets. |
| Dataset Splits | Yes | Following previous methods [5], 23384/1460/4383 samples are used for train/validation/test in Human ML3D, and 4888/300/830 are used for train/validation/test in KITML. |
| Hardware Specification | Yes | Our framework is trained on two NVIDIA A100 GPUs with PyTorch. |
| Software Dependencies | No | Our framework is trained on two NVIDIA A100 GPUs with PyTorch. - Only mentions PyTorch without a specific version number or other versioned libraries. |
| Experiment Setup | Yes | The batch size is set to 256 and the learning rate is set to 2e-4. To quantize the motion data into our 2D structure, we restructure the pose in the datasets to a joint-based format, with the size of 12 J. The data is then represented by the joint VQ codebook comprised of 256 codes, each with a dimension of 1024. ... The number of residual layers is set to 5... The transformers in our model are all set to have 6 layers, 6 heads, and 384 latent dimensions. The parameter α is set to 1 and N is set to 10. |