Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
HMVLM:Human Motion-Vision-Language Model via MoE LoRA
Authors: Lei Hu, Yongjing Ye, Shihong Xia
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments show that our method effectively alleviates knowledge forgetting during instruction-tuning and achieves remarkable performance across diverse human motion downstream tasks. Experimental results show that the proposed HMVLM, built on the Mo E Lo RA framework, significantly reduces the model s forgetting rate while achieving strong performance in text-to-motion generation, monocular pose estimation, and motion video understanding. |
| Researcher Affiliation | Academia | 1Institute of Computing Technology, Chinese Academy of Sciences 2University of Chinese Academy of Sciences EMAIL |
| Pseudocode | No | The paper describes the methodology in prose and mathematical formulations within Section 3, but does not include any explicitly labeled pseudocode or algorithm blocks/figures. |
| Open Source Code | Yes | Answer: [Yes] Justification: The code will be released. |
| Open Datasets | Yes | Datasets. We train the gating network with the LMSYS-Chat-1M dataset [77], using 80% of the data for training. For the text-to-motion task, we use Human ML3D [21] and KIT-ML [49] datasets. Notably, the motion tokenizer is trained on the same training splits of Human ML3D and KIT-ML for consistency. For pose estimation and pose tokenizer training, we use the Human3.6M [29] and 3DPW [62] datasets. The Mo Vid dataset [5] is used for instruction tuning in motion video understanding. |
| Dataset Splits | Yes | We train the gating network with the LMSYS-Chat-1M dataset [77], using 80% of the data for training. For the text-to-motion task, we use Human ML3D [21] and KIT-ML [49] datasets. Notably, the motion tokenizer is trained on the same training splits of Human ML3D and KIT-ML for consistency. For pose estimation and pose tokenizer training, we use the Human3.6M [29] and 3DPW [62] datasets. The Mo Vid dataset [5] is used for instruction tuning in motion video understanding. We conducted qualitative comparisons on the 3DPW dataset, following its official training and test split. |
| Hardware Specification | Yes | We conduct experiment on a single NVIDIA A800 80G GPU. |
| Software Dependencies | No | The paper mentions specific models and optimizers but does not provide explicit version numbers for general software dependencies like programming languages, libraries, or frameworks (e.g., Python, PyTorch versions). |
| Experiment Setup | Yes | We use Vicuna-7b-v1.5 [7] as the foundation language model with five Lo RA experts (including a zero expert), each of rank 8. Lo RA adapters are applied to all linear modules, and the gating network is implemented as a two-layer MLP with a hidden dimension of 512. It takes the 512-dimensional text features output by the CLIP model and predicts the weights for five experts. For detailed implementation, please refer to the Appendix. We conduct experiment on a single NVIDIA A800 80G GPU. In the pose and motion tokenizer, the number of body parts is set to 5, corresponding to the torso and four limbs with each part having an embedding dimension of S = 512. The codebook size for each body part is fixed at K = 512 and temporal compression ratio is set to l = 4 in motion tokenization. During training, the commitment loss coefficient λcom is set to 0.02. We use the Adam W optimizer with hyperparameters [β1, β2] = [0.9, 0.99] and a learning rate of 2 10 4. During the instruction tuning stage, simultaneous fine-tuning on three human motion-related tasks took a total of 120 hours. We used a batch size of 32 and a micro batch size of 2. Adam W is also used for optimization in this stage, with an initial learning rate of 3 10 3, which is scheduled using Cosine Annealing. |