A Closer Look at Deep Policy Gradients
Authors: Andrew Ilyas, Logan Engstrom, Shibani Santurkar, Dimitris Tsipras, Firdaus Janoos, Larry Rudolph, Aleksander Madry
ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our results show that the behavior of deep policy gradient algorithms often deviates from what their motivating framework would predict: the surrogate objective does not match the true reward landscape, learned value estimators fail to fit the true value function, and gradient estimates poorly correlate with the true gradient. Our experiments show that these primitives often do not conform to the expected behavior: gradient estimates poorly correlate with the true gradient, better gradient estimates can require lower learning rates and can induce degenerate agent behavior, value networks reduce gradient estimation variance to a significantly smaller extent than the true value, and the underlying optimization landscape can be misleading. |
| Researcher Affiliation | Collaboration | Andrew Ilyas 1, Logan Engstrom 1, Shibani Santurkar1, Dimitris Tsipras1, Firdaus Janoos2, Larry Rudolph1,2, and Aleksander M adry1 1MIT 2Two Sigma |
| Pseudocode | No | The paper describes the algorithms and equations but does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain any explicit statement about releasing source code or provide links to a code repository for the methodology described. |
| Open Datasets | Yes | Empirical variance of the estimated gradient (c.f. (1)) as a function of the number of stateaction pairs used in estimation in the Mu Jo Co Humanoid task. Mean reward for the studied policy gradient algorithms on standard Mu Jo Co benchmark tasks. |
| Dataset Splits | Yes | Quality of value prediction in terms of mean relative error (MRE) on heldout state-action pairs for agents trained to solve Mu Jo Co tasks. We see in that the agents do indeed succeed at solving the supervised learning task they are trained for the validation MRE on the GAE-based value loss (Vold + AGAE)2 (c.f. (4)) is small (left column). |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., CPU, GPU models, memory) used for running its experiments. |
| Software Dependencies | No | The paper does not provide specific ancillary software details with version numbers, such as programming languages or libraries used. |
| Experiment Setup | Yes | We use the following parameters for PPO and TRPO based on a hyperparameter grid search: Table 1: Hyperparameters for PPO and TRPO algorithms. Timesteps per iteration 2048 Discount factor (γ) 0.99 GAE discount (λ) 0.95 Value network LR 0.0001 Value net num. epochs 10 Policy net hidden layers [64, 64] Value net hidden layers [64, 64] KL constraint (δ) N/A 0.07 Fisher est. fraction N/A 0.1 Conjugate grad. steps N/A 10 CG damping N/A 0.1 Backtracking steps N/A 10 Policy LR (Adam) 0.00025 N/A Policy epochs 10 N/A PPO Clipping ε 0.2 N/A Entropy coeff. 0.0 0.0 Reward clipping [-10, 10] Reward normalization On Off State clipping [-10, 10] |