Deep Reinforcement Learning with Robust and Smooth Policy

Authors: Qianli Shen, Yan Li, Haoming Jiang, Zhaoran Wang, Tuo Zhao

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through extensive experiments, we demonstrate that our method achieves improved sample efficiency and robustness.
Researcher Affiliation Academia 1 Peking University, Beijing, China. 2 Georgia Institute of Technology, Atlanta, USA. 3Northwestern University, Evanston, USA.
Pseudocode Yes Algorithm 1 Trust Region Policy Optimization with Smoothness-inducing Regularization. ... Algorithm 2 DDPG with smoothness-inducing regularization on the actor network (DDPG-SR-A). ... Algorithm 3 DDPG with smoothness-inducing regularization on the critic network (DDPG-SR-C).
Open Source Code No Our implementation of SR2L training framework is based on the open source toolkit garage (garage contributors, 2019).
Open Datasets Yes We test our algorithms on Open AI gym (Brockman et al., 2016) control environments with the Mu Jo Co (Todorov et al., 2012) physics simulator.
Dataset Splits No The paper does not explicitly state specific training, validation, or test dataset splits (e.g., percentages or sample counts).
Hardware Specification No The paper does not provide specific hardware details such as GPU/CPU models, processor types, or memory used for running experiments.
Software Dependencies No The paper mentions using the "garage" toolkit but does not specify its version number or any other software dependencies with their respective versions.
Experiment Setup Yes For all tasks, we use a network of 2 hidden layers, each containing 64 neurons, to parameterize the policy and the Q-function. ... We use the grid search to select the hyper-parameters (perturbation strength ϵ, regularization coefficient λs) of the smoothness-inducing regularizer. We set the search range to be ϵ [10 5, 10 1], λs [10 2, 102]. To solve the inner maximization problem, we run 10 steps of projected gradient ascent, with step size set as 0.2ϵ. For each algorithm and each environment, we train 10 policies with different initialization for 500 iterations (1K environment steps for each iteration).