Towards Detailed Text-to-Motion Synthesis via Basic-to-Advanced Hierarchical Diffusion Model
Authors: Zhenyu Xie, Yang Wu, Xuehao Gao, Zhongqian Sun, Wei Yang, Xiaodan Liang
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Quantitative and qualitative experiment results on two text-to-motion benchmarks (Human ML3D and KIT-ML) demonstrate that B2A-HDM can outperform existing state-of-the-art methods in terms of fidelity, modality consistency, and diversity. |
| Researcher Affiliation | Collaboration | 1Shenzhen Campus of Sun Yat-sen University, Shenzhen, China 2Tencent AI Lab, Shenzhen, China 3 Xi an Jiao Tong University, Xi an, China 4 Dark Matter AI Research, Beijing, China |
| Pseudocode | Yes | Algorithm 1: Reverse Diffusion Process of B2A-HDM |
| Open Source Code | No | The paper does not include an unambiguous statement where the authors state they are releasing the code for the work described in this paper, nor does it provide a direct link to a source-code repository. |
| Open Datasets | Yes | Our experiments are conducted on two publicly available benchmarks for text-to-motion synthesis, namely KIT-ML (Plappert, Mandery, and Asfour 2016) and Human ML3D (Guo et al. 2022). |
| Dataset Splits | No | The paper mentions training for a certain number of epochs and observing performance on an 'evaluation set', but it does not specify explicit training/validation/test dataset splits with percentages, absolute sample counts, or clear predefined split methodologies for reproducibility. |
| Hardware Specification | Yes | Our B2A-HDM is implemented using Py Torch (Paszke et al. 2019) and both of motion VAE and diffusion denoiser are trained on 4 Tesla V100 GPUs. |
| Software Dependencies | No | The paper mentions using Py Torch and a pre-trained CLIP text encoder, but it does not provide specific version numbers for these software components or any other libraries needed to replicate the experiment. |
| Experiment Setup | Yes | The dimension of the latent space for BDM and ADM are 4 256 and 8 256, respectively. ADM is equipped with 2 denoisers. ... the batch size on each GPU is set to 96 and the all modules are trained by using Adam W (Loshchilov and Hutter 2019) optimizer with a fixed learning rate 1e-4. For Human ML3D, both VAE and denoiser are trained for 6,000 epochs, while for KIT-ML, the VAE and denoiser are trained for 25,000 epochs and 2,500 epochs, respectively. |