Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Improving Value Estimation Critically Enhances Vanilla Policy Gradient
Authors: Tao Wang, Ruipeng Zhang, Sicun Gao
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, empirical results show that simply increasing the number of value updates enables the basic VPG algorithm to match PPO s performance across multiple continuous control benchmarks in Gymnasium. This highlights the pivotal role of value estimation in improving policy gradient methods. |
| Researcher Affiliation | Academia | 1University of California, San Diego, La Jolla, USA. Correspondence to: Tao Wang <EMAIL>. |
| Pseudocode | No | The paper describes algorithms and methods through mathematical equations and narrative text, but it does not contain any clearly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | Yes | Code for the empirical results is available at https://github.com/ taowang0/value-estimation-vpg. |
| Open Datasets | Yes | We evaluate the role of value estimation across a range of continuous-control tasks from the Open AI Gym benchmarks (Brockman et al., 2016). |
| Dataset Splits | No | The paper uses OpenAI Gym benchmarks, which are reinforcement learning environments. It mentions batch sizes and mini-batch sizes for training but does not define explicit train/test/validation splits for a static dataset in the traditional supervised learning sense. Thus, the information required by the question (percentages, sample counts, predefined splits) is not applicable or provided. |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU types, memory amounts) used for running the experiments. It only mentions general experimental setup details in Appendix A without hardware specifications. |
| Software Dependencies | No | The paper mentions adapting implementations from "Tianshou (Weng et al., 2022)" and "Clean RL (Huang et al., 2022b)" in Section 6. However, it does not provide specific version numbers for these libraries or any other software dependencies, which is required for reproducibility. |
| Experiment Setup | Yes | Appendix A. Experiment Hyperparameters Table 3. Default hyperparameters for policy gradient algorithms. PPO VPG Num. env. 16 16 Discount factor (γ) 0.99 0.99 Num. epochs 10 1 Batch size 2048 2048 Minibatch size 64 2048 GAE factor (λ) 0.95 0.95 Optimizer Adam Adam Clipping parameter (ϵ) 0.2 N/A Advantage normalization False False Observation normalization True True Reward normalization False False Learning rate decay False False Entropy coefficient 0 0 Policy network [64, 64] [64, 64] Value network [64, 64] [64, 64] Activation function tanh tanh Gradient clipping (l2 norm) 1.0 1.0 Table 4. The learning rate used for the policy and value network in each task. Note that the learning rate 0.0003 is specifically chosen for Hopper to allow for a better comparison with PPO on that task. As shown in Figure 7 (a), using the learning rate 0.0007 in Hopper actually results in better performance. |