reproducibilityindex.ai

Cliff Diving: Exploring Reward Surfaces in Reinforcement Learning Environments

Authors: Ryan Sullivan, Jordan K Terry, Benjamin Black, John P Dickerson

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This work presents reward surfaces and related visualizations of 27 of the most widely used reinforcement learning environments in Gym for the ﬁrst time. We also explore reward surfaces in the policy gradient direction and show for the ﬁrst time that many popular reinforcement learning environments have frequent cliffs (sudden large drops in expected return). We demonstrate that A2C often dives off these cliffs into low reward regions of the parameter space while PPO avoids them, conﬁrming a popular intuition for PPO s improved performance over previous methods.
Researcher Affiliation	Collaboration	1Swarm Labs 2Department of Computer Science, University of Maryland, College Park.
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code	Yes	We developed an extensive software library for plotting the reward surfaces of reinforcement learning agents to produce this work and encourage future research using these visualizations. ... The library is well organized and documented, and it can be found at https://github.com/RyanNavillus/reward-surfaces.
Open Datasets	Yes	We generated plots for all Classic Control and Mu Jo Co environments in Gym (Brockman et al., 2016) and for many popular Atari environments from the Arcade Learning Environment (Bellemare et al., 2013).
Dataset Splits	No	The paper does not specify explicit training/validation/test dataset splits as commonly defined for static datasets in supervised learning. Instead, it describes training based on environment steps and uses evaluation episodes for checkpoint selection and performance assessment.
Hardware Specification	No	The paper mentions that scripts are included for
Software Dependencies	No	The paper mentions using
Experiment Setup	Yes	for these experiments we chose to plot the reward surface of PPO agents using the tuned hyperparameters found in RL Zoo 3 (Rafﬁn, 2020). ... As such, we try increasing both the learning rate (LR) and the number of steps per parallel environment per training update (N steps). ... Table 1. Table of A2C and PPO s average percent change in reward after taking a few gradient steps on cliff and non-cliff checkpoints for various sets of hyperparameters. These results are averaged among 10 trials each evaluated for 1000 episodes.