AMD: Autoregressive Motion Diffusion
Authors: Bo Han, Hao Peng, Minjing Dong, Yi Ren, Yixuan Shen, Chang Xu
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments Datasets and Evaluation Metrics Human Long3D We collected motion data using motion capture equipment and online sources and annotated each motion sequence with various semantic labels to create the Human Long3D dataset. ... Our proposed AMD achieves impressive performances on the Human ML3D, Human Long3D, AIST++, and Human Music datasets, which highlights its ability to generate high-fidelity motion given different modality inputs. |
| Researcher Affiliation | Collaboration | Bo Han1, Hao Peng2, Minjing Dong3, Yi Ren1, Yixuan Shen4, Chang Xu3* 1College of Computer Science and Technology, Zhejiang Univerisity 2Unity China 3School of Computer Science, Faculty of Engineering, The University of Sydney 4National University of Singapore |
| Pseudocode | No | The paper describes methods and processes but does not include any clearly labeled 'Pseudocode' or 'Algorithm' block. |
| Open Source Code | Yes | The codes for AMD and demos can be found in the Supplementary Materials. |
| Open Datasets | Yes | Human ML3D The dataset involves the textual reannotation of motion capture data from the AMASS (Mahmood et al. 2019) and Human Act12 (Guo et al. 2020), comprising 14,616 motions annotated with 44,970 textual descriptions. ... AIST++ This dataset (Li et al. 2021) comprises 992 high-quality 3D pose sequences in SMPL format (Loper et al. 2015), captured at 60 FPS, with 952 sequences designated for training and 40 for evaluation. |
| Dataset Splits | Yes | Since the Human ML3D dataset does not contain motion coherence information, we conducted this experiment only on the Human Long3D dataset, and we divided the dataset into training, test, and validation sets using a ratio of 0.85:0.10:0.05. |
| Hardware Specification | No | No specific hardware details (e.g., GPU/CPU models, memory) used for running experiments are provided in the paper. Only general training parameters are mentioned. |
| Software Dependencies | No | The paper mentions software like the 'CLIP model' and the 'Librosa' toolbox but does not provide specific version numbers for these or other key software dependencies required for replication. |
| Experiment Setup | Yes | Implement Details Motion Representation Our motion representation adopts the same format as Human ML3D, i.e., X R263 F . Each frame of motion is 263-dimensional data, including the position, linear velocity, angular velocity, joint space rotation of three-dimensional human joints, and label information for judging whether the foot joints are still. ... Motion Duration Prediction Network Lmin is set to 10 and Lmax is 50, each unit increment corresponds to 4 motion frames, i.e., 0.2s motion duration, so the duration prediction range covers the lower bound of 2s and the upper bound of 9.8s of the data samples. The motion duration prediction network is pretrained, with the motion duration prediction network being used only during inference. AMD Module We set the maximum noise scale T to be 1000, the coefficient β1:T is set to a linear increment from 10 4 to 0.02, latent vector dimensions are 512, the number of layers of the motion encoder is 6, and the number of heads of the multi-head attention mechanism is set to 6, the learning rate is fixed at 10 4, the number of training steps is 200000, and we use Adam W optimizer. Other Settings The output dimension of the motion linear layer and the latent vector dimension of the AMD module are both 512. The semantic conditional encoder adopts the CLIP-Vi T-B/32 checkpoint. During inference, the semantic prompt Si is input into the motion duration prediction network ED to obtain the estimated value F i of the motion sequence duration, which is used to determine the timing dimension for motion sequence sampling. |