A Study of Plasticity Loss in On-Policy Deep Reinforcement Learning

Authors: Arthur Juliani, Jordan Ash

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Here we perform an extensive set of experiments examining plasticity loss and a variety of mitigation methods in on-policy deep RL.
Researcher Affiliation Industry Arthur Juliani Jordan T. Ash Microsoft Research NYC {ajuliani, ash.jordan}@microsoft.com
Pseudocode No No explicit pseudocode or algorithm blocks were found in the paper. Methods are described in prose.
Open Source Code Yes Code which can be used to reproduce our results is available at https://github.com/awjuliani/deep-rl-plasticity.
Open Datasets Yes We use a simple gridworld task as our main sandbox for studying plasticity loss. The environment is drawn from the Neuro Nav library (Juliani et al., 2022), and the goal of the agent is to collect rewarding blue jewels while avoiding punishing red jewels in a fixed time window (100 time-steps per episode). Each sample of the environment from the distribution of possible tasks changes both the location of the jewels and the walls of the maze. Figure 1 shows environment modification examples. We also use the Coin Run environment from the Proc Gen suite of tasks to evaluate the set of candidate interventions Cobbe et al. (2019, 2020). Proc Gen is built using procedural generation Cobbe et al. (2020), making it possible to study the same three distribution-shift conditions which were considered in the gridworld experiments. Montezuma s Revenge is an Atari game that is often used to benchmark the quality of exploration procedures in RL Bellemare et al. (2013); Salimans and Chen (2018).
Dataset Splits No The paper describes training over multiple 'rounds' with changing environment distributions and distinguishes between 'training performance' and 'test performance' in figures. However, it does not explicitly provide specific percentages, sample counts, or a detailed methodology for traditional train/validation/test dataset splits, nor does it mention a separate validation set.
Hardware Specification Yes All experiments are conducted using either a single P100 or V100 GPU on a cloud machine.
Software Dependencies No The paper mentions using PPO, Neuro Nav library, and Adam optimizer but does not provide specific version numbers for any software dependencies like programming languages or deep learning frameworks.
Experiment Setup Yes Table 1: Optimal hyperparameter values used for different environments and interventions. Chosen values are the result of performing a sweep over possible values using permute environment shift condition. Table 2: Default values used in PPO algorithm for all gridworld and Coin Run experiments.