Deep Reinforcement Learning with Robust and Smooth Policy
Authors: Qianli Shen, Yan Li, Haoming Jiang, Zhaoran Wang, Tuo Zhao
ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through extensive experiments, we demonstrate that our method achieves improved sample efficiency and robustness. |
| Researcher Affiliation | Academia | 1 Peking University, Beijing, China. 2 Georgia Institute of Technology, Atlanta, USA. 3Northwestern University, Evanston, USA. |
| Pseudocode | Yes | Algorithm 1 Trust Region Policy Optimization with Smoothness-inducing Regularization. ... Algorithm 2 DDPG with smoothness-inducing regularization on the actor network (DDPG-SR-A). ... Algorithm 3 DDPG with smoothness-inducing regularization on the critic network (DDPG-SR-C). |
| Open Source Code | No | Our implementation of SR2L training framework is based on the open source toolkit garage (garage contributors, 2019). |
| Open Datasets | Yes | We test our algorithms on Open AI gym (Brockman et al., 2016) control environments with the Mu Jo Co (Todorov et al., 2012) physics simulator. |
| Dataset Splits | No | The paper does not explicitly state specific training, validation, or test dataset splits (e.g., percentages or sample counts). |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU/CPU models, processor types, or memory used for running experiments. |
| Software Dependencies | No | The paper mentions using the "garage" toolkit but does not specify its version number or any other software dependencies with their respective versions. |
| Experiment Setup | Yes | For all tasks, we use a network of 2 hidden layers, each containing 64 neurons, to parameterize the policy and the Q-function. ... We use the grid search to select the hyper-parameters (perturbation strength ϵ, regularization coefficient λs) of the smoothness-inducing regularizer. We set the search range to be ϵ [10 5, 10 1], λs [10 2, 102]. To solve the inner maximization problem, we run 10 steps of projected gradient ascent, with step size set as 0.2ϵ. For each algorithm and each environment, we train 10 policies with different initialization for 500 iterations (1K environment steps for each iteration). |