Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

🎧MOSPA: Human Motion Generation Driven by Spatial Audio

Authors: Shuyang Xu, Zhiyang Dou, Mingyi Shi, Liang Pan, Leo Ho, Jingbo Wang, Yuan Liu, Cheng Lin, Yuexin Ma, Wenping Wang, Taku Komura

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We perform a thorough investigation of the proposed dataset and conduct extensive experiments for benchmarking, where our method achieves state-of-the-art performance on this task. Our code and model are publicly available at our website. (...) Extensive evaluations on the SAM demonstrate that MOSPA achieves state-of-the-art performance on this task, outperforming existing baselines in generating realistic and diverse motion responses to spatial audio.
Researcher Affiliation Collaboration 1The University of Hong Kong 2Shanghai AI Lab 3The Hong Kong University of Science and Technology 4Macau University of Science and Technology 5Shanghai Tech University 6Texas A&M University
Pseudocode No The paper describes the methodology in narrative text and uses diagrams (e.g., Figure 5 for MOSPA framework) but does not present any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No Our dataset, code, and models will be publicly released for further research. (...) Our code and model are publicly available at our website. (...) The dataset and the codes will be released after acceptance.
Open Datasets No We introduce the first comprehensive Spatial Audio-Driven Human Motion (SAM) dataset, which contains diverse and high-quality spatial audio and motion data. (...) Our dataset, code, and models will be publicly released for further research. (...) The dataset and the codes will be released after acceptance.
Dataset Splits Yes The dataset is split into training, validation, and test sub-datasets at a common ratio of 8:1:1. Consequently, the training sub-dataset comprises 2,400 motion sequences, while the validation and test sub-datasets each contain approximately 300 motion sequences.
Hardware Specification Yes The entire training process requires approximately 18 hours on a single RTX 4090 GPU with a batch size of 128.
Software Dependencies No The paper mentions using AdamW [55] as an optimizer and references librosa [61] for audio processing but does not provide specific version numbers for the key software components used in their implementation environment (e.g., Python version, PyTorch version, etc.).
Experiment Setup Yes The encoder transformer is configured with a latent dimension of 512, 8 heads, and 4 layers. We employ Adam W [55] as the optimizer with an initial value of 1 10 4. The number of denoising steps used is 1000, and the noise schedule is cosine. The training phase concludes after 6, 000 epochs. (...) All loss weights (λ) are initialized set to 1. At epoch 5,000 of the total 6,000 training epochs, λtraj and λrot are increased to 3 (...).