Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
VimoRAG: Video-based Retrieval-augmented 3D Motion Generation for Motion Language Models
Authors: Haidong Xu, Guangwei Xu, Zhedong Zheng, Xiatian Zhu, Wei Ji, Xiangtai Li, Ruijie Guo, Meishan Zhang, Min zhang, Hao Fei
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results show that Vimo RAG significantly boosts the performance of motion LLMs constrained to text-only input. To evaluate its effectiveness, we conduct both cross-domain and in-domain experiments. To explore its potential, we conduct scaling experiments with varying retrieval corpus sizes. Quantitative Results. Tables 1 and 2 present a quantitative comparison between Vimo RAG and So TA techniques. |
| Researcher Affiliation | Academia | 1 Harbin Institute of Technology (Shenzhen) 2 University of Macau 3 University of Surrey 4 Nanjing University 5 Nanyang Technological University 6 National University of Singapore |
| Pseudocode | No | The paper describes methods and training strategies in detail but does not contain explicit pseudocode blocks or algorithms formatted as such. |
| Open Source Code | Yes | All the resources (https://walkermitty.github.io/Vimo RAG/) are available. |
| Open Datasets | Yes | We conduct extensive experiments on two widely used large-scale datasets following the existing works [16]. The first is the IDEA400 dataset, a high-quality whole-body motion dataset composed of 12.5K clips and 2.6M frames in Motion X [11], which is utilized to assess OOD performance. The other dataset, Human ML3D [5], comprising 14,616 motion clips and 44,970 text descriptions, is utilized to evaluate in-domain performance. |
| Dataset Splits | Yes | Human ML3D [5] constitutes the largest dataset available, providing text descriptions alongside body-only motions. It includes a total of 14,616 motion clips and 44,970 text descriptions, with 5,371 unique words present across these descriptions. The dataset is partitioned into a training set (80%), a validation set (5%), and a test set (15%). |
| Hardware Specification | Yes | Inference is conducted using a single NVIDIA A800 GPU, while training is accelerated using 8 GPUs to enhance efficiency. |
| Software Dependencies | No | We implement Vimo RAG with Py Torch. However, no specific version of PyTorch or any other software dependency is provided. |
| Experiment Setup | Yes | In Stage 1 of Mc DPO, we train 2 epochs with a learning rate 2e-4 for Lo RA parameters (rank = 128, α = 256), with a learning rate 2e-5 for the visual adapter s parameters. In Stage 2 of Mc DPO, we train 1 epoch with a learning rate 2e-4. For the training configurations of Mc DPO, during Stage 1, we set the batch size to 64, weight decay to 0.0, and the maximum context length to 4096, employing a bf16 precision format. |