Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

SoPo: Text-to-Motion Generation Using Semi-Online Preference Optimization

Authors: Xiaofeng Tan, Hongsong Wang, Xin Geng, Pan Zhou

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that So Po outperforms other preference alignment methods, with an MM-Dist of 3.25% (vs e.g. 0.76% of Mo Di PO) on the MLD model, 2.91% (vs e.g. 0.66% of Mo Di PO) on MDM model, respectively. Additionally, the MLD model fine-tuned by our So Po surpasses the So TA model in terms of R-precision and MM Dist. Visualization results also show the efficacy of our So Po in preference alignment.
Researcher Affiliation Academia Xiaofeng Tan1,2 Hongsong Wang 1,2 Xin Geng1,2 Pan Zhou3 1Department of Computer Science and Engineering, Southeast University, Nanjing, China 2Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, Nanjing, China 3 Singapore Management University EMAIL, EMAIL,
Pseudocode Yes Algorithm 1 So Po for text-to-motion generation
Open Source Code No We plan to release the code and detailed documentation after the acceptance of the paper.
Open Datasets Yes For text-to-motion generation, we evaluate So Po on two widely used datasets, Human ML3D [3] and KIT-ML [36], focusing on two key aspects: alignment and generation quality. ... For text-to-image generation, we utilize Flux-Dev [37] as the foundational model and employ HPSv2 [38] as the reward model. To construct the offline training pairs, we first sample data from the HPDv2 dataset.
Dataset Splits Yes Human ML3D is derived from the AMASS [50] and Human Act12 [51] datasets and contains 14,616 motions, each described by three textual annotations. All motion is split into train, test, and evaluate sets, composed of 23384, 1460, and 4380 motions, respectively. For both Human ML3D and KIT-ML datasets, we follow the official split and report the evaluated performance on the test set.
Hardware Specification Yes All models are trained in 100 minutes on a single NVIDIA Ge Force RTX 4090D GPU. ... The text-to-image model was trained for 330 GPU hours across 8 NVIDIA GPUs using Lo RA, configured with a rank of r = 128 and a scaling factor α = 256.
Software Dependencies No The paper mentions specific optimizers and models (e.g., Adam W optimizer, MLD, MDM) but does not provide specific version numbers for software libraries or frameworks like Python, PyTorch, or CUDA.
Experiment Setup Yes We use a batch size of 64, with a guidance parameter of 2.5 during testing. Diffusion employs a cosine noise schedule with 50 steps, and an evaluation batch size of 32 ensures consistent metric computation. For fine-tuning MLD [1], we similarly follow its original parameter settings. ... Hyperparameters K and τ are tuned through preliminary experiments to balance performance and efficiency, with τ = 0.45, C = 2, and β = 1 in Eq. (14). We set K = 4 for MDM [40] and K = 2 for MLD [1].