Cliff Diving: Exploring Reward Surfaces in Reinforcement Learning Environments
Authors: Ryan Sullivan, Jordan K Terry, Benjamin Black, John P Dickerson
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This work presents reward surfaces and related visualizations of 27 of the most widely used reinforcement learning environments in Gym for the first time. We also explore reward surfaces in the policy gradient direction and show for the first time that many popular reinforcement learning environments have frequent cliffs (sudden large drops in expected return). We demonstrate that A2C often dives off these cliffs into low reward regions of the parameter space while PPO avoids them, confirming a popular intuition for PPO s improved performance over previous methods. |
| Researcher Affiliation | Collaboration | 1Swarm Labs 2Department of Computer Science, University of Maryland, College Park. |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | We developed an extensive software library for plotting the reward surfaces of reinforcement learning agents to produce this work and encourage future research using these visualizations. ... The library is well organized and documented, and it can be found at https://github.com/RyanNavillus/reward-surfaces. |
| Open Datasets | Yes | We generated plots for all Classic Control and Mu Jo Co environments in Gym (Brockman et al., 2016) and for many popular Atari environments from the Arcade Learning Environment (Bellemare et al., 2013). |
| Dataset Splits | No | The paper does not specify explicit training/validation/test dataset splits as commonly defined for static datasets in supervised learning. Instead, it describes training based on environment steps and uses evaluation episodes for checkpoint selection and performance assessment. |
| Hardware Specification | No | The paper mentions that scripts are included for |
| Software Dependencies | No | The paper mentions using |
| Experiment Setup | Yes | for these experiments we chose to plot the reward surface of PPO agents using the tuned hyperparameters found in RL Zoo 3 (Raffin, 2020). ... As such, we try increasing both the learning rate (LR) and the number of steps per parallel environment per training update (N steps). ... Table 1. Table of A2C and PPO s average percent change in reward after taking a few gradient steps on cliff and non-cliff checkpoints for various sets of hyperparameters. These results are averaged among 10 trials each evaluated for 1000 episodes. |