Cliff Diving: Exploring Reward Surfaces in Reinforcement Learning Environments

Authors: Ryan Sullivan, Jordan K Terry, Benjamin Black, John P Dickerson

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental This work presents reward surfaces and related visualizations of 27 of the most widely used reinforcement learning environments in Gym for the first time. We also explore reward surfaces in the policy gradient direction and show for the first time that many popular reinforcement learning environments have frequent cliffs (sudden large drops in expected return). We demonstrate that A2C often dives off these cliffs into low reward regions of the parameter space while PPO avoids them, confirming a popular intuition for PPO s improved performance over previous methods.
Researcher Affiliation Collaboration 1Swarm Labs 2Department of Computer Science, University of Maryland, College Park.
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes We developed an extensive software library for plotting the reward surfaces of reinforcement learning agents to produce this work and encourage future research using these visualizations. ... The library is well organized and documented, and it can be found at https://github.com/RyanNavillus/reward-surfaces.
Open Datasets Yes We generated plots for all Classic Control and Mu Jo Co environments in Gym (Brockman et al., 2016) and for many popular Atari environments from the Arcade Learning Environment (Bellemare et al., 2013).
Dataset Splits No The paper does not specify explicit training/validation/test dataset splits as commonly defined for static datasets in supervised learning. Instead, it describes training based on environment steps and uses evaluation episodes for checkpoint selection and performance assessment.
Hardware Specification No The paper mentions that scripts are included for
Software Dependencies No The paper mentions using
Experiment Setup Yes for these experiments we chose to plot the reward surface of PPO agents using the tuned hyperparameters found in RL Zoo 3 (Raffin, 2020). ... As such, we try increasing both the learning rate (LR) and the number of steps per parallel environment per training update (N steps). ... Table 1. Table of A2C and PPO s average percent change in reward after taking a few gradient steps on cliff and non-cliff checkpoints for various sets of hyperparameters. These results are averaged among 10 trials each evaluated for 1000 episodes.