On Proximal Policy Optimization’s Heavy-tailed Gradients

Authors: Saurabh Garg, Joshua Zhanson, Emilio Parisotto, Adarsh Prasad, Zico Kolter, Zachary Lipton, Sivaraman Balakrishnan, Ruslan Salakhutdinov, Pradeep Ravikumar

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we present a detailed empirical study to characterize the heavy-tailed nature of the gradients of the PPO surrogate reward function. We demonstrate that the gradients, especially for the actor network, exhibit pronounced heavy-tailedness and that it increases as the agent s policy diverges from the behavioral policy (i.e., as the agent goes further off policy). Further examination implicates the likelihood ratios and advantages in the surrogate reward as the main sources of the observed heavy-tailedness.
Researcher Affiliation Academia 1Machine Learning Department, Carnegie Mellon University 2Computer Science Department, Carnegie Mellon University 3Department of Statistics and Data Science, Carnegie Mellon University. Correspondence to: Saurabh Garg <sgarg2@andrew.cmu.edu>.
Pseudocode Yes Algorithm 1 BLOCK-GMOM
Open Source Code No No explicit statement or link to open-source code for the described methodology was found.
Open Datasets No The paper mentions 'Mu Jo Co continuous control tasks' and references 'Todorov et al., 2012' for MuJoCo, which is a physics engine and simulation environment. While commonly used for RL experiments, this does not provide a direct link, DOI, or specific citation for a publicly available dataset itself, as data is typically generated within the environment rather than being a static dataset.
Dataset Splits No The paper does not explicitly provide training/test/validation dataset splits (e.g., percentages or sample counts), nor does it reference predefined splits with citations for the MuJoCo tasks.
Hardware Specification No No specific hardware details (e.g., GPU/CPU models, memory) used for running experiments are provided.
Software Dependencies No The paper does not list specific version numbers for ancillary software dependencies (e.g., Python, PyTorch, TensorFlow, specific libraries).
Experiment Setup Yes We compare the performances of PPO, PPO-NOCLIP, and ROBUST-PPO-NOCLIP, using hyperparameters that are tuned individually for each method but held fixed across all tasks (Table 1).