Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

MotionBind: Multi-Modal Human Motion Alignment for Retrieval, Recognition, and Generation

Authors: Kaleab Kinfu, Rene Vidal

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Motion Bind achieves state-of-the-art or competitive performance across motion reconstruction, cross-modal retrieval, zero-shot action recognition, and text-to-motion generation benchmarks. The code is available at: https://github.com/vidal-lab/Motion Bind. Extensive experiments on motion reconstruction, cross-modal retrieval, zero-shot action recognition, and text-to-motion synthesis demonstrate that our methods achieve state-of-the-art performance across five benchmarks.
Researcher Affiliation	Academia	Kaleab A. Kinfu University of Pennsylvania Philadelphia, PA 19104, USA EMAIL René Vidal University of Pennsylvania Philadelphia, PA 19104, USA EMAIL
Pseudocode	No	The paper describes model architectures and training procedures using textual explanations and mathematical equations, but it does not contain explicit pseudocode or algorithm blocks labeled as such.
Open Source Code	Yes	The code is available at: https://github.com/vidal-lab/Motion Bind.
Open Datasets	Yes	As outlined in the main paper, we use four publicly available human motion datasets in our experiments: AMASS [18], Human ML3D [16], KIT-ML [17], and AIST++ [19].
Dataset Splits	Yes	For AMASS, which lacks an official split, we constructed a 70/30 train-test split for evaluation. Human ML3D and KIT-ML provide natural language descriptions paired with 3D human motion sequences, making them well-suited for text-to-motion synthesis and cross-modal retrieval tasks.
Hardware Specification	Yes	All experiments were conducted using 8 NVIDIA RTX A5000 GPUs (each with 24 GB of memory) distributed across a single node.
Software Dependencies	No	Model training and inference were implemented in Py Torch and MMCV, leveraging mixed-precision training (via Py Torch AMP) and distributed data parallelism using the NCCL backend for efficiency and scalability. The paper mentions software but does not specify version numbers for reproducibility.
Experiment Setup	Yes	We used the Adam W optimizer with cosine annealing and linear warmup for both Mu TMo T and REALM. Gradient clipping with a maximum norm of 1.0 was applied to stabilize training. For Mu TMo T, the model was trained for 20 epochs with a batch size of 384 on each GPU, distributed across GPUs with gradient accumulation to ensure stable optimization. REALM was trained for up to 50 epochs using a 1000-step diffusion schedule during training and a reduced 50-step schedule during inference. Typical training time for Mu TMo T is approximately 8 hours, while REALM training takes roughly 5 days to converge. The main training hyperparameters are summarized in Table 5.