On Proximal Policy Optimization’s Heavy-tailed Gradients
Authors: Saurabh Garg, Joshua Zhanson, Emilio Parisotto, Adarsh Prasad, Zico Kolter, Zachary Lipton, Sivaraman Balakrishnan, Ruslan Salakhutdinov, Pradeep Ravikumar
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we present a detailed empirical study to characterize the heavy-tailed nature of the gradients of the PPO surrogate reward function. We demonstrate that the gradients, especially for the actor network, exhibit pronounced heavy-tailedness and that it increases as the agent s policy diverges from the behavioral policy (i.e., as the agent goes further off policy). Further examination implicates the likelihood ratios and advantages in the surrogate reward as the main sources of the observed heavy-tailedness. |
| Researcher Affiliation | Academia | 1Machine Learning Department, Carnegie Mellon University 2Computer Science Department, Carnegie Mellon University 3Department of Statistics and Data Science, Carnegie Mellon University. Correspondence to: Saurabh Garg <sgarg2@andrew.cmu.edu>. |
| Pseudocode | Yes | Algorithm 1 BLOCK-GMOM |
| Open Source Code | No | No explicit statement or link to open-source code for the described methodology was found. |
| Open Datasets | No | The paper mentions 'Mu Jo Co continuous control tasks' and references 'Todorov et al., 2012' for MuJoCo, which is a physics engine and simulation environment. While commonly used for RL experiments, this does not provide a direct link, DOI, or specific citation for a publicly available dataset itself, as data is typically generated within the environment rather than being a static dataset. |
| Dataset Splits | No | The paper does not explicitly provide training/test/validation dataset splits (e.g., percentages or sample counts), nor does it reference predefined splits with citations for the MuJoCo tasks. |
| Hardware Specification | No | No specific hardware details (e.g., GPU/CPU models, memory) used for running experiments are provided. |
| Software Dependencies | No | The paper does not list specific version numbers for ancillary software dependencies (e.g., Python, PyTorch, TensorFlow, specific libraries). |
| Experiment Setup | Yes | We compare the performances of PPO, PPO-NOCLIP, and ROBUST-PPO-NOCLIP, using hyperparameters that are tuned individually for each method but held fixed across all tasks (Table 1). |