reproducibilityindex.ai

On Proximal Policy Optimization’s Heavy-tailed Gradients

Authors: Saurabh Garg, Joshua Zhanson, Emilio Parisotto, Adarsh Prasad, Zico Kolter, Zachary Lipton, Sivaraman Balakrishnan, Ruslan Salakhutdinov, Pradeep Ravikumar

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper, we present a detailed empirical study to characterize the heavy-tailed nature of the gradients of the PPO surrogate reward function. We demonstrate that the gradients, especially for the actor network, exhibit pronounced heavy-tailedness and that it increases as the agent s policy diverges from the behavioral policy (i.e., as the agent goes further off policy). Further examination implicates the likelihood ratios and advantages in the surrogate reward as the main sources of the observed heavy-tailedness.
Researcher Affiliation	Academia	1Machine Learning Department, Carnegie Mellon University 2Computer Science Department, Carnegie Mellon University 3Department of Statistics and Data Science, Carnegie Mellon University. Correspondence to: Saurabh Garg <sgarg2@andrew.cmu.edu>.
Pseudocode	Yes	Algorithm 1 BLOCK-GMOM
Open Source Code	No	No explicit statement or link to open-source code for the described methodology was found.
Open Datasets	No	The paper mentions 'Mu Jo Co continuous control tasks' and references 'Todorov et al., 2012' for MuJoCo, which is a physics engine and simulation environment. While commonly used for RL experiments, this does not provide a direct link, DOI, or specific citation for a publicly available dataset itself, as data is typically generated within the environment rather than being a static dataset.
Dataset Splits	No	The paper does not explicitly provide training/test/validation dataset splits (e.g., percentages or sample counts), nor does it reference predefined splits with citations for the MuJoCo tasks.
Hardware Specification	No	No specific hardware details (e.g., GPU/CPU models, memory) used for running experiments are provided.
Software Dependencies	No	The paper does not list specific version numbers for ancillary software dependencies (e.g., Python, PyTorch, TensorFlow, specific libraries).
Experiment Setup	Yes	We compare the performances of PPO, PPO-NOCLIP, and ROBUST-PPO-NOCLIP, using hyperparameters that are tuned individually for each method but held fixed across all tasks (Table 1).