Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Efficient Online Reinforcement Learning for Diffusion Policy
Authors: Haitong Ma, Tianyi Chen, Kai Wang, Na Li, Bo Dai
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conducted comprehensive comparisons on Mu Jo Co benchmarks. The empirical results show that the proposed algorithms outperform recent diffusion-policy online RLs on most tasks, and the DPMD improves more than 120% over soft actor-critic on Humanoid and Ant. We conduct extensive empirical evaluation on Mu Jo Co, showing that the proposed algorithms outperform recent diffusion-based online RL baselines in most tasks. |
| Researcher Affiliation | Academia | 1Harvard University. 2Georgia Institute of Technology. Emails: Haitong Ma <EMAIL>, Na Li <EMAIL>, Bo Dai <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Diffusion Policy Mirror Descent (DPMD) ... Algorithm 2 Soft Diffusion Actor-Critic (SDAC) |
| Open Source Code | Yes | 3The implementation can be found at https://github.com/mahaitongdae/diffusion policy online rl. |
| Open Datasets | Yes | evaluated the performance on 10 Open AI Gym Mu Jo Co v4 tasks. |
| Dataset Splits | Yes | All environments except Humanoid-v4 are trained over 200K iterations with a total of 1 million environment interactions, while Humanoid-v4 has five times more. The results are evaluated with the average return of 20 episodes across 5 random seeds. |
| Hardware Specification | Yes | The computation is conducted on a desktop workstation with AMD Ryzen 9 7950X CPU, 96 GB memory, and NVIDIA RTX 4090 GPU. |
| Software Dependencies | No | We implemented the proposed DPMD and SDAC algorithms with the JAX package3 |
| Experiment Setup | Yes | Table 4. Hyperparameters Name Value Critic learning rate 3e-4 Policy learning rate 3e-4, linear annealing to 3e-5 Diffusion steps 20 Diffusion noise schedules Cosine Policy network hidden layers 3 Policy network hidden neurons 256 Policy network activation Mish Value network hidden layers 3 Value network hidden neurons 256 Value network activation Mish Replay buffer size (off-policy only) 1 million |