Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

VimoRAG: Video-based Retrieval-augmented 3D Motion Generation for Motion Language Models

Authors: Haidong Xu, Guangwei Xu, Zhedong Zheng, Xiatian Zhu, Wei Ji, Xiangtai Li, Ruijie Guo, Meishan Zhang, Min zhang, Hao Fei

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results show that Vimo RAG significantly boosts the performance of motion LLMs constrained to text-only input. To evaluate its effectiveness, we conduct both cross-domain and in-domain experiments. To explore its potential, we conduct scaling experiments with varying retrieval corpus sizes. Quantitative Results. Tables 1 and 2 present a quantitative comparison between Vimo RAG and So TA techniques.
Researcher Affiliation Academia 1 Harbin Institute of Technology (Shenzhen) 2 University of Macau 3 University of Surrey 4 Nanjing University 5 Nanyang Technological University 6 National University of Singapore
Pseudocode No The paper describes methods and training strategies in detail but does not contain explicit pseudocode blocks or algorithms formatted as such.
Open Source Code Yes All the resources (https://walkermitty.github.io/Vimo RAG/) are available.
Open Datasets Yes We conduct extensive experiments on two widely used large-scale datasets following the existing works [16]. The first is the IDEA400 dataset, a high-quality whole-body motion dataset composed of 12.5K clips and 2.6M frames in Motion X [11], which is utilized to assess OOD performance. The other dataset, Human ML3D [5], comprising 14,616 motion clips and 44,970 text descriptions, is utilized to evaluate in-domain performance.
Dataset Splits Yes Human ML3D [5] constitutes the largest dataset available, providing text descriptions alongside body-only motions. It includes a total of 14,616 motion clips and 44,970 text descriptions, with 5,371 unique words present across these descriptions. The dataset is partitioned into a training set (80%), a validation set (5%), and a test set (15%).
Hardware Specification Yes Inference is conducted using a single NVIDIA A800 GPU, while training is accelerated using 8 GPUs to enhance efficiency.
Software Dependencies No We implement Vimo RAG with Py Torch. However, no specific version of PyTorch or any other software dependency is provided.
Experiment Setup Yes In Stage 1 of Mc DPO, we train 2 epochs with a learning rate 2e-4 for Lo RA parameters (rank = 128, α = 256), with a learning rate 2e-5 for the visual adapter s parameters. In Stage 2 of Mc DPO, we train 1 epoch with a learning rate 2e-4. For the training configurations of Mc DPO, during Stage 1, we set the batch size to 64, weight decay to 0.0, and the maximum context length to 4096, employing a bf16 precision format.