Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

SoPo: Text-to-Motion Generation Using Semi-Online Preference Optimization

Authors: Xiaofeng Tan, Hongsong Wang, Xin Geng, Pan Zhou

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate that So Po outperforms other preference alignment methods, with an MM-Dist of 3.25% (vs e.g. 0.76% of Mo Di PO) on the MLD model, 2.91% (vs e.g. 0.66% of Mo Di PO) on MDM model, respectively. Additionally, the MLD model fine-tuned by our So Po surpasses the So TA model in terms of R-precision and MM Dist. Visualization results also show the efficacy of our So Po in preference alignment.
Researcher Affiliation	Academia	Xiaofeng Tan1,2 Hongsong Wang 1,2 Xin Geng1,2 Pan Zhou3 1Department of Computer Science and Engineering, Southeast University, Nanjing, China 2Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, Nanjing, China 3 Singapore Management University EMAIL, EMAIL,
Pseudocode	Yes	Algorithm 1 So Po for text-to-motion generation
Open Source Code	No	We plan to release the code and detailed documentation after the acceptance of the paper.
Open Datasets	Yes	For text-to-motion generation, we evaluate So Po on two widely used datasets, Human ML3D [3] and KIT-ML [36], focusing on two key aspects: alignment and generation quality. ... For text-to-image generation, we utilize Flux-Dev [37] as the foundational model and employ HPSv2 [38] as the reward model. To construct the offline training pairs, we first sample data from the HPDv2 dataset.
Dataset Splits	Yes	Human ML3D is derived from the AMASS [50] and Human Act12 [51] datasets and contains 14,616 motions, each described by three textual annotations. All motion is split into train, test, and evaluate sets, composed of 23384, 1460, and 4380 motions, respectively. For both Human ML3D and KIT-ML datasets, we follow the official split and report the evaluated performance on the test set.
Hardware Specification	Yes	All models are trained in 100 minutes on a single NVIDIA Ge Force RTX 4090D GPU. ... The text-to-image model was trained for 330 GPU hours across 8 NVIDIA GPUs using Lo RA, configured with a rank of r = 128 and a scaling factor α = 256.
Software Dependencies	No	The paper mentions specific optimizers and models (e.g., Adam W optimizer, MLD, MDM) but does not provide specific version numbers for software libraries or frameworks like Python, PyTorch, or CUDA.
Experiment Setup	Yes	We use a batch size of 64, with a guidance parameter of 2.5 during testing. Diffusion employs a cosine noise schedule with 50 steps, and an evaluation batch size of 32 ensures consistent metric computation. For fine-tuning MLD [1], we similarly follow its original parameter settings. ... Hyperparameters K and τ are tuned through preliminary experiments to balance performance and efficiency, with τ = 0.45, C = 2, and β = 1 in Eq. (14). We set K = 4 for MDM [40] and K = 2 for MLD [1].