MoGenTS: Motion Generation based on Spatial-Temporal Joint Modeling

Authors: Weihao Yuan, Yisheng HE, Weichao Shen, Yuan Dong, Xiaodong Gu, Zilong Dong, Liefeng Bo, Qixing Huang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that our method significantly outperforms previous methods across different datasets, with a 26.6% decrease of FID on Human ML3D and a 29.9% decrease on KIT-ML.
Researcher Affiliation Collaboration 1 Alibaba Group 2 The University of Texas at Austin
Pseudocode No The paper describes the methodology using text and figures, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The code is not included for now. But we will release the code to the public soon.
Open Datasets Yes We evaluate our text-to-motion model on Human ML3D [5] and KIT-ML [55] datasets.
Dataset Splits Yes Following previous methods [5], 23384/1460/4383 samples are used for train/validation/test in Human ML3D, and 4888/300/830 are used for train/validation/test in KITML.
Hardware Specification Yes Our framework is trained on two NVIDIA A100 GPUs with PyTorch.
Software Dependencies No Our framework is trained on two NVIDIA A100 GPUs with PyTorch. - Only mentions PyTorch without a specific version number or other versioned libraries.
Experiment Setup Yes The batch size is set to 256 and the learning rate is set to 2e-4. To quantize the motion data into our 2D structure, we restructure the pose in the datasets to a joint-based format, with the size of 12 J. The data is then represented by the joint VQ codebook comprised of 256 codes, each with a dimension of 1024. ... The number of residual layers is set to 5... The transformers in our model are all set to have 6 layers, 6 heads, and 384 latent dimensions. The parameter α is set to 1 and N is set to 10.