HUMANISE: Language-conditioned Human Motion Generation in 3D Scenes

Authors: Zan Wang, Yixin Chen, Tengyu Liu, Yixin Zhu, Wei Liang, Siyuan Huang

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments demonstrate that our model generates diverse and semantically consistent human motions in 3D scenes; it outperforms the baselines on various evaluation metrics. We benchmark our proposed task language-conditioned human motion generation in 3D scenes on HUMANISE and describe the detailed settings, baselines, analyses, and ablative studies.
Researcher Affiliation Collaboration 1 School of Computer Science & Technology, Beijing Institute of Technology 2 Beijing Institute for General Artificial Intelligence (BIGAI) 3 Institute for Artificial Intelligence, Peking University 4 Yangtze Delta Region Academy of Beijing Institute of Technology, Jiaxing
Pseudocode No The paper describes the model architecture and training process in text and diagrams (Fig. 3) but does not provide formal pseudocode or algorithm blocks.
Open Source Code No The paper provides a project website link (https://silverster98.github.io/HUMANISE/) but does not explicitly state that source code for the methodology is released or provide a direct link to a code repository.
Open Datasets Yes To tackles the above issues, we propose a large-scale and semantic-rich synthetic HSI dataset, HUMANISE (see Fig. 1), by aligning the captured human motion sequences [Mahmood et al., 2019] with the scanned indoor scenes [Dai et al., 2017].
Dataset Splits No We split motions in HUMANISE according to the original scene IDs and split in Scan Net [Dai et al., 2017], resulting in 16.5k motions in 543 scenes for training and 3.1k motions in 100 scenes for testing.
Hardware Specification Yes We train our model with a batch size of 32 on a V100 GPU.
Software Dependencies No The paper mentions using Adam and pre-trained BERT, but does not provide specific version numbers for these software components or other libraries.
Experiment Setup Yes We train our generative model on HUMANISE for 150 epochs using Adam [Kingma and Ba, 2014] and a fixed learning rate of 0.0001. For hyper-parameters, we empirically set αkl = αo = 0.1, αa = 0.5, αr = 1.0, and αp = αv = 10.0. We set the dimension of global condition latent zc to 512 and latent z to 256. The hidden state size is set to 256 in the single-layer bidirectional GRU motion encoder. The transformer motion decoder contains two standard layers with the 512 hidden state size. We train our model with a batch size of 32 on a V100 GPU.