Improving Deep Policy Gradients with Value Function Search

Authors: Enrico Marchesini, Christopher Amato

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show the effectiveness of VFS on different Deep PG baselines: (i) Proximal Policy Optimization (PPO) (Schulman et al., 2017), (ii) Deep Deterministic Policy Gradient (DDPG) (Lillicrap et al., 2016), and (iii) TD3 on a range of continuous control benchmark tasks (Brockman et al., 2016; Todorov et al., 2012). Our evaluation confirms that VFS leads to better gradient estimates with a lower variance that significantly improve sample efficiency and lead to policies with higher returns.
Researcher Affiliation Academia Enrico Marchesini, Christopher Amato Khoury College of Computer Sciences Northeastern University Boston, MA, USA {e.marchesini, c.amato}@northeastern.edu
Pseudocode Yes Algorithm 1 Value Function Search
Open Source Code No No explicit statement about making the source code available or a link to a repository was found.
Open Datasets Yes We conduct our experiments on five continuous control tasks based on Mu Jo Co (Brockman et al., 2016) (in their v4 version), which are widely used for comparing Deep PG, ensembles, and policy search approach (Schulman et al., 2017; Lillicrap et al., 2016; Lee et al., 2021; Fujimoto et al., 2018).
Dataset Splits No The paper uses continuous environments (MuJoCo) where data is generated through interaction, not pre-split datasets. It describes how data is sampled for training and evaluation during the RL process but does not specify a fixed training/validation/test split for a dataset.
Hardware Specification Yes Data are collected on nodes equipped with Xeon E5-2650 CPUs and 64GB of RAM, using the hyperparameters of Appendix E.
Software Dependencies No The paper mentions software components like PPO, DDPG, TD3, and MuJoCo, but does not specify their version numbers (e.g., PyTorch 1.x, TensorFlow 2.x).
Experiment Setup Yes Table 2: Hyperparameters for our experiments Algorithm Parameter Value PPO ϵ clip 0.2 γ 0.99 update epochs 10 samples per update 2048 mini-batch size 64 entropy coefficient 0.001 lr 0.0003/0.0001 DDPG γ 0.99 buffer size 1000000 mini-batch size 128 actor lr 0.0001 critic lr 0.001/0.0003 τ 0.005 TD3 actor noise clip 0.5 delayed actor update 2 VFS population size 10 σmin 0.000005 σmax 0.0005 b 2048 es 2048 steps