Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

MotionBind: Multi-Modal Human Motion Alignment for Retrieval, Recognition, and Generation

Authors: Kaleab Kinfu, Rene Vidal

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Motion Bind achieves state-of-the-art or competitive performance across motion reconstruction, cross-modal retrieval, zero-shot action recognition, and text-to-motion generation benchmarks. The code is available at: https://github.com/vidal-lab/Motion Bind. Extensive experiments on motion reconstruction, cross-modal retrieval, zero-shot action recognition, and text-to-motion synthesis demonstrate that our methods achieve state-of-the-art performance across five benchmarks.
Researcher Affiliation Academia Kaleab A. Kinfu University of Pennsylvania Philadelphia, PA 19104, USA EMAIL RenΓ© Vidal University of Pennsylvania Philadelphia, PA 19104, USA EMAIL
Pseudocode No The paper describes model architectures and training procedures using textual explanations and mathematical equations, but it does not contain explicit pseudocode or algorithm blocks labeled as such.
Open Source Code Yes The code is available at: https://github.com/vidal-lab/Motion Bind.
Open Datasets Yes As outlined in the main paper, we use four publicly available human motion datasets in our experiments: AMASS [18], Human ML3D [16], KIT-ML [17], and AIST++ [19].
Dataset Splits Yes For AMASS, which lacks an official split, we constructed a 70/30 train-test split for evaluation. Human ML3D and KIT-ML provide natural language descriptions paired with 3D human motion sequences, making them well-suited for text-to-motion synthesis and cross-modal retrieval tasks.
Hardware Specification Yes All experiments were conducted using 8 NVIDIA RTX A5000 GPUs (each with 24 GB of memory) distributed across a single node.
Software Dependencies No Model training and inference were implemented in Py Torch and MMCV, leveraging mixed-precision training (via Py Torch AMP) and distributed data parallelism using the NCCL backend for efficiency and scalability. The paper mentions software but does not specify version numbers for reproducibility.
Experiment Setup Yes We used the Adam W optimizer with cosine annealing and linear warmup for both Mu TMo T and REALM. Gradient clipping with a maximum norm of 1.0 was applied to stabilize training. For Mu TMo T, the model was trained for 20 epochs with a batch size of 384 on each GPU, distributed across GPUs with gradient accumulation to ensure stable optimization. REALM was trained for up to 50 epochs using a 1000-step diffusion schedule during training and a reduced 50-step schedule during inference. Typical training time for Mu TMo T is approximately 8 hours, while REALM training takes roughly 5 days to converge. The main training hyperparameters are summarized in Table 5.