Towards Detailed Text-to-Motion Synthesis via Basic-to-Advanced Hierarchical Diffusion Model

Authors: Zhenyu Xie, Yang Wu, Xuehao Gao, Zhongqian Sun, Wei Yang, Xiaodan Liang

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Quantitative and qualitative experiment results on two text-to-motion benchmarks (Human ML3D and KIT-ML) demonstrate that B2A-HDM can outperform existing state-of-the-art methods in terms of fidelity, modality consistency, and diversity.
Researcher Affiliation Collaboration 1Shenzhen Campus of Sun Yat-sen University, Shenzhen, China 2Tencent AI Lab, Shenzhen, China 3 Xi an Jiao Tong University, Xi an, China 4 Dark Matter AI Research, Beijing, China
Pseudocode Yes Algorithm 1: Reverse Diffusion Process of B2A-HDM
Open Source Code No The paper does not include an unambiguous statement where the authors state they are releasing the code for the work described in this paper, nor does it provide a direct link to a source-code repository.
Open Datasets Yes Our experiments are conducted on two publicly available benchmarks for text-to-motion synthesis, namely KIT-ML (Plappert, Mandery, and Asfour 2016) and Human ML3D (Guo et al. 2022).
Dataset Splits No The paper mentions training for a certain number of epochs and observing performance on an 'evaluation set', but it does not specify explicit training/validation/test dataset splits with percentages, absolute sample counts, or clear predefined split methodologies for reproducibility.
Hardware Specification Yes Our B2A-HDM is implemented using Py Torch (Paszke et al. 2019) and both of motion VAE and diffusion denoiser are trained on 4 Tesla V100 GPUs.
Software Dependencies No The paper mentions using Py Torch and a pre-trained CLIP text encoder, but it does not provide specific version numbers for these software components or any other libraries needed to replicate the experiment.
Experiment Setup Yes The dimension of the latent space for BDM and ADM are 4 256 and 8 256, respectively. ADM is equipped with 2 denoisers. ... the batch size on each GPU is set to 96 and the all modules are trained by using Adam W (Loshchilov and Hutter 2019) optimizer with a fixed learning rate 1e-4. For Human ML3D, both VAE and denoiser are trained for 6,000 epochs, while for KIT-ML, the VAE and denoiser are trained for 25,000 epochs and 2,500 epochs, respectively.