Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
SnapMoGen: Human Motion Generation from Expressive Texts
Authors: chuan guo, Inwoo Hwang, Jian Wang, Bing Zhou
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on both Human ML3D and Snap Mo Gen demonstrate the state-of-the-art performance of Mo Mask++. Overall, Mo Mask++ attains state-of-the-art performance on both datasets, demonstrating consistent improvements in motion-text alignment and motion quality. We perform comprehensive ablation experiments to evaluate the effects of various hyper-parameters and technical designs, as shown in Tab. 4 and Tab. 5. |
| Researcher Affiliation | Collaboration | Chuan Guo1 , Inwoo Hwang2, Jian Wang1, Bing Zhou1 1Snap Inc. 2 Seoul National University Project lead: EMAIL; EMAIL |
| Pseudocode | No | The paper describes the methodology using textual explanations, mathematical equations, and figures, but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper provides a project website link (https://snap-research.github.io/Snap Mo Gen/) which is a high-level project overview page. It does not explicitly state that the source code for the methodology described in this paper is available at this link or in supplementary materials. |
| Open Datasets | Yes | In this paper, we introduced Snap Mo Gen, a high-quality text-motion dataset featuring temporally continuous motion segments with expressive textual annotations. Comprising 20K motion clips and 122K detailed descriptions averaging 48 words each, Snap Mo Gen provides significantly richer semantic information than existing datasets. We also establish comprehensive benchmarks and evaluation protocols for this new dataset. Datasets Year # Clips Duration # Texts Avg. words per text Avg. length per clip Mocap? Continuous? KIT-ML [28] 2016 3,911 10.3 h 6,278 8 9.5s BABEL [30] 2021 52,937 33.2h 52,937 2 2.3s Human ML3D [10] 2022 14,616 28.6h 44,970 12 7.1s Motion-X [19] 2023 81,084 144.2h 81,084 9 6.4s Snap Mo Gen 2025 20,450 43.7h 122,565 48 7.8s Table 1: Comparisons with public datasets. Snap Mo Gen highlights its accurate and expressive text descriptions, high-quality motion capture data, and continuous motion segmentation. indicates values calculated only from the publicly available BABEL subset. denotes a combination of 40,859 manual text annotations and 81,706 LLM-augmented annotations, both with an average text length of 48 words. Besides Snap Mo Gen, we also conduct experiments on Human ML3D [10], a popular motion-text dataset comprising 14,616 motions with 44,970 textual descriptions. https://snap-research.github.io/Snap Mo Gen/ |
| Dataset Splits | Yes | To prevent data leakage, we deliberately hold out a test (%10) set and a validation (%5) set where the motion scenarios (e.g., fashion) differ from the training motions. |
| Hardware Specification | Yes | All models are trained on a single Tesla V100 GPU, with batch size of 256 for VQVAEs and 64 for transformers. |
| Software Dependencies | No | The paper mentions using T5-base models for text encoding but does not provide specific version numbers for any software libraries, frameworks, or programming languages used (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | Our VQVAE encoder and decoder consists of three dilated res-blocks, with a down(up)-scale factor of 4 [37, 9]. The temporal quantization scales follows the progression [n/2V , ..., n/20] with n denoting the full-scale length. We employ 4 (i.e., V = 3) quantization layers for Human ML3D and 2 for Snap Mo Gen, with codebook sizes of 512 Γ 512 and 2048 Γ 512, respectively. The hyper-parameters Ξ² and Ξ»ess are set to 0.02 and 2.0. Our transformer architecture comprise 8 layers with feedforward size of 1024, latent dimension of 384, 6 attention heads, and a dropout ratio of 0.2, totaling 13.5M parameters for in-context model, and 18.3M parameters for cross-attention model. During inference, we use classifier-free guidance scales of 5 and 4, and iteration counts (L) of 10 and 18 for Snap Mo Gen and Human ML3D, respectively. All models are trained on a single Tesla V100 GPU, with batch size of 256 for VQVAEs and 64 for transformers. |