reproducibilityindex.ai

A Closer Look at Deep Policy Gradients

Authors: Andrew Ilyas, Logan Engstrom, Shibani Santurkar, Dimitris Tsipras, Firdaus Janoos, Larry Rudolph, Aleksander Madry

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our results show that the behavior of deep policy gradient algorithms often deviates from what their motivating framework would predict: the surrogate objective does not match the true reward landscape, learned value estimators fail to ﬁt the true value function, and gradient estimates poorly correlate with the true gradient. Our experiments show that these primitives often do not conform to the expected behavior: gradient estimates poorly correlate with the true gradient, better gradient estimates can require lower learning rates and can induce degenerate agent behavior, value networks reduce gradient estimation variance to a signiﬁcantly smaller extent than the true value, and the underlying optimization landscape can be misleading.
Researcher Affiliation	Collaboration	Andrew Ilyas 1, Logan Engstrom 1, Shibani Santurkar1, Dimitris Tsipras1, Firdaus Janoos2, Larry Rudolph1,2, and Aleksander M adry1 1MIT 2Two Sigma
Pseudocode	No	The paper describes the algorithms and equations but does not include any structured pseudocode or algorithm blocks.
Open Source Code	No	The paper does not contain any explicit statement about releasing source code or provide links to a code repository for the methodology described.
Open Datasets	Yes	Empirical variance of the estimated gradient (c.f. (1)) as a function of the number of stateaction pairs used in estimation in the Mu Jo Co Humanoid task. Mean reward for the studied policy gradient algorithms on standard Mu Jo Co benchmark tasks.
Dataset Splits	Yes	Quality of value prediction in terms of mean relative error (MRE) on heldout state-action pairs for agents trained to solve Mu Jo Co tasks. We see in that the agents do indeed succeed at solving the supervised learning task they are trained for the validation MRE on the GAE-based value loss (Vold + AGAE)2 (c.f. (4)) is small (left column).
Hardware Specification	No	The paper does not provide specific hardware details (e.g., CPU, GPU models, memory) used for running its experiments.
Software Dependencies	No	The paper does not provide specific ancillary software details with version numbers, such as programming languages or libraries used.
Experiment Setup	Yes	We use the following parameters for PPO and TRPO based on a hyperparameter grid search: Table 1: Hyperparameters for PPO and TRPO algorithms. Timesteps per iteration 2048 Discount factor (γ) 0.99 GAE discount (λ) 0.95 Value network LR 0.0001 Value net num. epochs 10 Policy net hidden layers [64, 64] Value net hidden layers [64, 64] KL constraint (δ) N/A 0.07 Fisher est. fraction N/A 0.1 Conjugate grad. steps N/A 10 CG damping N/A 0.1 Backtracking steps N/A 10 Policy LR (Adam) 0.00025 N/A Policy epochs 10 N/A PPO Clipping ε 0.2 N/A Entropy coeff. 0.0 0.0 Reward clipping [-10, 10] Reward normalization On Off State clipping [-10, 10]