Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

GenPO: Generative Diffusion Models Meet On-Policy Reinforcement Learning

Authors: Shutong Ding, Ke Hu, Shan Zhong, Haoyang Luo, Weinan Zhang, Jingya Wang, Jun Wang, Ye Shi

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on eight Isaac Lab benchmarks, including legged locomotion (Ant, Humanoid, Anymal-D, Unitree H1, Go2), dexterous manipulation (Shadow Hand), aerial control (Quadcopter), and robotic arm tasks (Franka), demonstrate Gen PO s superiority over existing RL baselines. Notably, Gen PO is the first method to successfully integrate diffusion policies into on-policy RL, unlocking their potential for large-scale parallelized training and real-world robotic deployment.
Researcher Affiliation Academia 1Shanghai Tech University 2University of Electronic Science and Technology of China 3Shanghai Jiao Tong University 4University College London 5Mo E Key Laboratory of Intelligent Perception and Human Machine Collaboration
Pseudocode Yes Algorithm 1 Generative Diffusion Policy Optimization
Open Source Code Yes The official implementation of Gen PO is provided in https://github.com/wadx2019/genpo/.
Open Datasets Yes To demonstrate the superiority of Gen PO, we conduct experiments on 8 Isaac Lab benchmarks [38], which cover robot ant, humanoid, quadcopter, Franka robot arm [23], Shadow dexterous hand [56], ANYbotics anymal-D, and Unitree legged robots [68].
Dataset Splits No The paper uses Isaac Lab benchmarks, which are simulation environments for Reinforcement Learning. It does not provide explicit training/test/validation splits for a fixed dataset, as data is generated through interaction with the environment. It mentions 'All experiments are repeated across five random seeds' and 'Sample dummy actions ... to interact with the environment for N timesteps', which are characteristic of RL experimentation rather than static dataset splitting.
Hardware Specification Yes All experiments were carried out on a server equipped with two Intel Xeon Gold 6430 CPUs (32 cores per socket, 64 threads total per CPU, 128 threads total), with a base frequency of 2.1 GHz and a maximum turbo frequency of 3.4 GHz. The system supports 52-bit physical and 57-bit virtual addressing. For GPU acceleration, we used 8 NVIDIA Ge Force RTX 4090 D GPUs, each with 24 GB of GDDR6X memory, connected via PCIe. The GPUs support CUDA 12.8 and were operating under the NVIDIA driver version 570.124.04.
Software Dependencies No The paper mentions 'CUDA 12.8' and 'NVIDIA driver version 570.124.04'. It also lists several reinforcement learning frameworks used for baselines and its own implementation (RSL-RL, RL-Games, SKRL, Stable-Baselines3), but it does not specify version numbers for these software libraries, which are key dependencies.
Experiment Setup Yes Tables 2 and Table 10 summarize the hyperparameter configurations used across our experiments. For the baseline algorithms: PPO, SAC, TD3, and DDPG, we adopt the default hyperparameter settings provided by the SKRL library. For our proposed method, Gen PO, we align its hyperparameter configuration with that of PPO to ensure a fair and controlled comparison.