Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

GenPO: Generative Diffusion Models Meet On-Policy Reinforcement Learning

Authors: Shutong Ding, Ke Hu, Shan Zhong, Haoyang Luo, Weinan Zhang, Jingya Wang, Jun Wang, Ye Shi

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on eight Isaac Lab benchmarks, including legged locomotion (Ant, Humanoid, Anymal-D, Unitree H1, Go2), dexterous manipulation (Shadow Hand), aerial control (Quadcopter), and robotic arm tasks (Franka), demonstrate Gen PO s superiority over existing RL baselines. Notably, Gen PO is the first method to successfully integrate diffusion policies into on-policy RL, unlocking their potential for large-scale parallelized training and real-world robotic deployment.
Researcher Affiliation	Academia	1Shanghai Tech University 2University of Electronic Science and Technology of China 3Shanghai Jiao Tong University 4University College London 5Mo E Key Laboratory of Intelligent Perception and Human Machine Collaboration
Pseudocode	Yes	Algorithm 1 Generative Diffusion Policy Optimization
Open Source Code	Yes	The official implementation of Gen PO is provided in https://github.com/wadx2019/genpo/.
Open Datasets	Yes	To demonstrate the superiority of Gen PO, we conduct experiments on 8 Isaac Lab benchmarks [38], which cover robot ant, humanoid, quadcopter, Franka robot arm [23], Shadow dexterous hand [56], ANYbotics anymal-D, and Unitree legged robots [68].
Dataset Splits	No	The paper uses Isaac Lab benchmarks, which are simulation environments for Reinforcement Learning. It does not provide explicit training/test/validation splits for a fixed dataset, as data is generated through interaction with the environment. It mentions 'All experiments are repeated across five random seeds' and 'Sample dummy actions ... to interact with the environment for N timesteps', which are characteristic of RL experimentation rather than static dataset splitting.
Hardware Specification	Yes	All experiments were carried out on a server equipped with two Intel Xeon Gold 6430 CPUs (32 cores per socket, 64 threads total per CPU, 128 threads total), with a base frequency of 2.1 GHz and a maximum turbo frequency of 3.4 GHz. The system supports 52-bit physical and 57-bit virtual addressing. For GPU acceleration, we used 8 NVIDIA Ge Force RTX 4090 D GPUs, each with 24 GB of GDDR6X memory, connected via PCIe. The GPUs support CUDA 12.8 and were operating under the NVIDIA driver version 570.124.04.
Software Dependencies	No	The paper mentions 'CUDA 12.8' and 'NVIDIA driver version 570.124.04'. It also lists several reinforcement learning frameworks used for baselines and its own implementation (RSL-RL, RL-Games, SKRL, Stable-Baselines3), but it does not specify version numbers for these software libraries, which are key dependencies.
Experiment Setup	Yes	Tables 2 and Table 10 summarize the hyperparameter configurations used across our experiments. For the baseline algorithms: PPO, SAC, TD3, and DDPG, we adopt the default hyperparameter settings provided by the SKRL library. For our proposed method, Gen PO, we align its hyperparameter configuration with that of PPO to ensure a fair and controlled comparison.