Improving Deep Policy Gradients with Value Function Search
Authors: Enrico Marchesini, Christopher Amato
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show the effectiveness of VFS on different Deep PG baselines: (i) Proximal Policy Optimization (PPO) (Schulman et al., 2017), (ii) Deep Deterministic Policy Gradient (DDPG) (Lillicrap et al., 2016), and (iii) TD3 on a range of continuous control benchmark tasks (Brockman et al., 2016; Todorov et al., 2012). Our evaluation confirms that VFS leads to better gradient estimates with a lower variance that significantly improve sample efficiency and lead to policies with higher returns. |
| Researcher Affiliation | Academia | Enrico Marchesini, Christopher Amato Khoury College of Computer Sciences Northeastern University Boston, MA, USA {e.marchesini, c.amato}@northeastern.edu |
| Pseudocode | Yes | Algorithm 1 Value Function Search |
| Open Source Code | No | No explicit statement about making the source code available or a link to a repository was found. |
| Open Datasets | Yes | We conduct our experiments on five continuous control tasks based on Mu Jo Co (Brockman et al., 2016) (in their v4 version), which are widely used for comparing Deep PG, ensembles, and policy search approach (Schulman et al., 2017; Lillicrap et al., 2016; Lee et al., 2021; Fujimoto et al., 2018). |
| Dataset Splits | No | The paper uses continuous environments (MuJoCo) where data is generated through interaction, not pre-split datasets. It describes how data is sampled for training and evaluation during the RL process but does not specify a fixed training/validation/test split for a dataset. |
| Hardware Specification | Yes | Data are collected on nodes equipped with Xeon E5-2650 CPUs and 64GB of RAM, using the hyperparameters of Appendix E. |
| Software Dependencies | No | The paper mentions software components like PPO, DDPG, TD3, and MuJoCo, but does not specify their version numbers (e.g., PyTorch 1.x, TensorFlow 2.x). |
| Experiment Setup | Yes | Table 2: Hyperparameters for our experiments Algorithm Parameter Value PPO ϵ clip 0.2 γ 0.99 update epochs 10 samples per update 2048 mini-batch size 64 entropy coefficient 0.001 lr 0.0003/0.0001 DDPG γ 0.99 buffer size 1000000 mini-batch size 128 actor lr 0.0001 critic lr 0.001/0.0003 τ 0.005 TD3 actor noise clip 0.5 delayed actor update 2 VFS population size 10 σmin 0.000005 σmax 0.0005 b 2048 es 2048 steps |