reproducibilityindex.ai

Deep Reinforcement Learning with Robust and Smooth Policy

Authors: Qianli Shen, Yan Li, Haoming Jiang, Zhaoran Wang, Tuo Zhao

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through extensive experiments, we demonstrate that our method achieves improved sample efﬁciency and robustness.
Researcher Affiliation	Academia	1 Peking University, Beijing, China. 2 Georgia Institute of Technology, Atlanta, USA. 3Northwestern University, Evanston, USA.
Pseudocode	Yes	Algorithm 1 Trust Region Policy Optimization with Smoothness-inducing Regularization. ... Algorithm 2 DDPG with smoothness-inducing regularization on the actor network (DDPG-SR-A). ... Algorithm 3 DDPG with smoothness-inducing regularization on the critic network (DDPG-SR-C).
Open Source Code	No	Our implementation of SR2L training framework is based on the open source toolkit garage (garage contributors, 2019).
Open Datasets	Yes	We test our algorithms on Open AI gym (Brockman et al., 2016) control environments with the Mu Jo Co (Todorov et al., 2012) physics simulator.
Dataset Splits	No	The paper does not explicitly state specific training, validation, or test dataset splits (e.g., percentages or sample counts).
Hardware Specification	No	The paper does not provide specific hardware details such as GPU/CPU models, processor types, or memory used for running experiments.
Software Dependencies	No	The paper mentions using the "garage" toolkit but does not specify its version number or any other software dependencies with their respective versions.
Experiment Setup	Yes	For all tasks, we use a network of 2 hidden layers, each containing 64 neurons, to parameterize the policy and the Q-function. ... We use the grid search to select the hyper-parameters (perturbation strength ϵ, regularization coefﬁcient λs) of the smoothness-inducing regularizer. We set the search range to be ϵ [10 5, 10 1], λs [10 2, 102]. To solve the inner maximization problem, we run 10 steps of projected gradient ascent, with step size set as 0.2ϵ. For each algorithm and each environment, we train 10 policies with different initialization for 500 iterations (1K environment steps for each iteration).