Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

WaveAR: Wavelet-Aware Continuous Autoregressive Diffusion for Accurate Human Motion Prediction

Authors: shengchuan gao, Shuo Wang, Yabiao Wang, Ran Yi

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on standard benchmarks demonstrate that our approach delivers more accurate and computationally efficient predictions than prior state-of-the-art methods.
Researcher Affiliation Collaboration Shengchuan Gao1* Shuo Wang2* Yabiao Wang2,3 Ran Yi1 1Shanghai Jiao Tong University 2Tencent Youtu Lab 3Zhejiang University EMAIL, EMAIL EMAIL, EMAIL
Pseudocode No The paper describes the method using textual descriptions and architectural diagrams (Figure 1, Figure 4) but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks with structured steps.
Open Source Code No Answer: [No] Justification: We will open-source the code after the paper is published.
Open Datasets Yes We evaluate our method on two widely adopted benchmarks for stochastic human motion prediction (SHMP): Human3.6M [20], Human Eva-I [42] and AMASS[33].
Dataset Splits Yes To ensure compatibility with prior studies, we follow the protocol of [6, 43], modeling each pose with a 16-joint skeleton. Given the first 0.5 seconds (25 frames) of observed motion, the task is to forecast the subsequent 2 seconds (100 frames). Human Eva-I provides 3D motion captured at 60 Hz from three actors each performing five distinct movements, with poses encoded as 15-joint skeletons. Following common practice, we use the first 0.25 s (15 frames) of each sequence as input and task our model with forecasting the next 1 s (60 frames) of motion.
Hardware Specification No The paper does not explicitly mention specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments. While Table 2 compares inference times, it does not provide hardware information.
Software Dependencies No The paper mentions using the Adam W optimizer [32] and general model architectures (Transformer layers, MLP), but it does not specify exact version numbers for programming languages (e.g., Python), libraries (e.g., PyTorch, TensorFlow), or other software dependencies.
Experiment Setup Yes We employ a lightweight ST-VAE with a two-layer encoder-decoder architecture. Each hidden layer has a dimension of 128, and we apply a temporal downsampling rate of 2. It is trained for 500 epochs with a batch size of 128. ... For the Human3.6M dataset, the diffusion backbone consists of 12 Transformer layers: the first 6 layers each combine self-attention and cross-attention over the wavelet embeddings, while the remaining 6 layers use only self-attention. We set the latent dimension to 256. For Human Eva-I, we use the same overall design but employ only 3 layers with both self- and cross-attention, followed by 3 self-attention layers, also with a latent dimension of 256. The noise prediction network in the diffusion model is a 3-layer MLP with a hidden dimension of 1024. We optimize the model for 200 epochs using the Adam W optimizer [32] with β1 = 0.5, β2 = 0.99, and an initial learning rate of 2 10 4. A multi-step learning-rate scheduler with decay factor γ = 0.9 is applied, and the batch size is increased to 256 to stabilize training.