MotionGPT: Human Motion as a Foreign Language

Authors: Biao Jiang, Xin Chen, Wen Liu, Jingyi Yu, Gang Yu, Tao Chen

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that Motion GPT achieves state-of-the-art performances on multiple motion tasks including text-driven motion generation, motion captioning, motion prediction, and motion in-between.
Researcher Affiliation Collaboration 1Fudan University 2Tencent 3Shanghai Tech University
Pseudocode No The paper describes methods in text and diagrams (Figure 2, Figure 3) but does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes https://github.com/Open Motion Lab/Motion GPT
Open Datasets Yes The study primarily focuses on two text-to-motion datasets: Human ML3D [10] and KIT [31].
Dataset Splits No The paper references well-known datasets (Human ML3D [10], KIT [31]) and mentions using part of the AMASS dataset [25], but it does not provide specific details on the train, validation, and test splits (e.g., percentages, sample counts, or explicit references to standard splits with details) for reproducing the data partitioning.
Hardware Specification Yes Small and Base models are trained on 8 Tesla V100 GPUs while Large models are trianed on 64 Tesla V100 GPUs.
Software Dependencies No The paper mentions using T5 as the language model backbone and AdamW optimizer, but it does not provide specific version numbers for software components or libraries (e.g., Python version, deep learning framework versions like PyTorch or TensorFlow).
Experiment Setup Yes We set the codebook of motion tokenizer as K R512 512 for most comparisons. The motion encoder E incorporates a temporal downsampling rate l of 4. We utilize T5 [36] as the underlying architecture for our language model, with a baseline model consisting of 12 layers in both the transformer encoder and decoder. The feed-forward networks have an output dimensionality of dff = 3072, and the attention mechanisms employ an inner dimensionality of dkv = 64. The remaining sub-layers and embeddings have a dimensionality of dmodel = 768. Moreover, all our models employ the Adam W optimizer for training. The motion tokenizers are trained utilizing a 10 4 learning rate and a 256 mini-batch size, while our language models have a 2 10 4 learning rate for the pre-train stage, 10 4 for the instruction tuning stage, and a 16 mini-batch size for both stages. The motion tokenizer undergoes 150K iterations of training, while the language model undergoes 300K iterations during the pre-train stage and another 300K iterations during the instruction tuning stage.