Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Improving Value Estimation Critically Enhances Vanilla Policy Gradient

Authors: Tao Wang, Ruipeng Zhang, Sicun Gao

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Finally, empirical results show that simply increasing the number of value updates enables the basic VPG algorithm to match PPO s performance across multiple continuous control benchmarks in Gymnasium. This highlights the pivotal role of value estimation in improving policy gradient methods.
Researcher Affiliation	Academia	1University of California, San Diego, La Jolla, USA. Correspondence to: Tao Wang <EMAIL>.
Pseudocode	No	The paper describes algorithms and methods through mathematical equations and narrative text, but it does not contain any clearly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code	Yes	Code for the empirical results is available at https://github.com/ taowang0/value-estimation-vpg.
Open Datasets	Yes	We evaluate the role of value estimation across a range of continuous-control tasks from the Open AI Gym benchmarks (Brockman et al., 2016).
Dataset Splits	No	The paper uses OpenAI Gym benchmarks, which are reinforcement learning environments. It mentions batch sizes and mini-batch sizes for training but does not define explicit train/test/validation splits for a static dataset in the traditional supervised learning sense. Thus, the information required by the question (percentages, sample counts, predefined splits) is not applicable or provided.
Hardware Specification	No	The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU types, memory amounts) used for running the experiments. It only mentions general experimental setup details in Appendix A without hardware specifications.
Software Dependencies	No	The paper mentions adapting implementations from "Tianshou (Weng et al., 2022)" and "Clean RL (Huang et al., 2022b)" in Section 6. However, it does not provide specific version numbers for these libraries or any other software dependencies, which is required for reproducibility.
Experiment Setup	Yes	Appendix A. Experiment Hyperparameters Table 3. Default hyperparameters for policy gradient algorithms. PPO VPG Num. env. 16 16 Discount factor (γ) 0.99 0.99 Num. epochs 10 1 Batch size 2048 2048 Minibatch size 64 2048 GAE factor (λ) 0.95 0.95 Optimizer Adam Adam Clipping parameter (ϵ) 0.2 N/A Advantage normalization False False Observation normalization True True Reward normalization False False Learning rate decay False False Entropy coefficient 0 0 Policy network [64, 64] [64, 64] Value network [64, 64] [64, 64] Activation function tanh tanh Gradient clipping (l2 norm) 1.0 1.0 Table 4. The learning rate used for the policy and value network in each task. Note that the learning rate 0.0003 is specifically chosen for Hopper to allow for a better comparison with PPO on that task. As shown in Figure 7 (a), using the learning rate 0.0007 in Hopper actually results in better performance.