Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Value Diffusion Reinforcement Learning
Authors: Xiaoliang Hu, Fuyun Wang, Tong Zhang, Zhen Cui
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments conducted on the Mu Jo Co benchmark demonstrate that VDRL significantly outperforms some SOTA model-free online RL baselines, showcasing its effectiveness and robustness. |
| Researcher Affiliation | Academia | 1 School of Computer Science and Engineering, Nanjing University of Science and Technology 2 School of Artificial Intelligence, Beijing Normal University EMAIL EMAIL, EMAIL |
| Pseudocode | Yes | Algorithm 1 Value Diffusion Reinforcement Learning |
| Open Source Code | No | The proposed algorithm of the training process is presented in Section 3.3. The intricate implementation details of and training hyperparameters are deferred to Section 5.2 and Appendix C. |
| Open Datasets | Yes | We use eight different level tasks of the popular Mu Jo Co [33] benchmark 3 to evaluate the performance of all methods, including Ant-v3, Half Cheetah-v3, Hopperv3, Humanoid-v3, Inverted2Pendulum-v2, Pusher-v2, Swimmer-v3, and Walker2d-v3. The more details of these tasks and the Mu Jo Co benchmark are deferred to Appendix B. [33] Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ international conference on intelligent robots and systems, pages 5026 5033. IEEE, 2012. |
| Dataset Splits | Yes | For the proposed method and all baselines, all experiments are conducted over 800,000 training interaction steps with four runs and different random seeds (0, 50, 100, 200), with each performing one evaluation rollout every 10,000 interaction steps. In particular, each evaluation result is the average of ten episodes. |
| Hardware Specification | Yes | Note that all experiments are conducted on a 2.90GHz Intel Core i7-10700 CPU, 64G RAM, and NVIDIA Ge Force RTX 3090 GPU. |
| Software Dependencies | No | The implementation of other baselines and our method is based on Py Torch. However, specific version numbers for PyTorch or other libraries used for VDRL are not mentioned. |
| Experiment Setup | Yes | The setting of all hyperparameters for the training is presented in Table 3. Table 3: Hyperparameters for training. name Value Description Optim Adam the optimizer of method n_iteration 8 105 Maximum iteration steps until the end of training buffer_size 1 106 capacity of replay buffer batch_size 256 number of samples from each update evaluate_cycle 10000 how often to evaluate the model No. of hidden layers 2 the number of hidden layers No. of hidden nodes 256 the number of hidden nodes Activation_π Ge LU the activation of policy Activation_Zi Mish the activation of value distributions lrϕ 3 10 4 learning rate for policy lrθi 3 10 4 learning rate for value distributions τ 0.005 the target momentum coefficient γ 0.99 discount factor T 10 diffusion step |