A Study of Plasticity Loss in On-Policy Deep Reinforcement Learning
Authors: Arthur Juliani, Jordan Ash
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Here we perform an extensive set of experiments examining plasticity loss and a variety of mitigation methods in on-policy deep RL. |
| Researcher Affiliation | Industry | Arthur Juliani Jordan T. Ash Microsoft Research NYC {ajuliani, ash.jordan}@microsoft.com |
| Pseudocode | No | No explicit pseudocode or algorithm blocks were found in the paper. Methods are described in prose. |
| Open Source Code | Yes | Code which can be used to reproduce our results is available at https://github.com/awjuliani/deep-rl-plasticity. |
| Open Datasets | Yes | We use a simple gridworld task as our main sandbox for studying plasticity loss. The environment is drawn from the Neuro Nav library (Juliani et al., 2022), and the goal of the agent is to collect rewarding blue jewels while avoiding punishing red jewels in a fixed time window (100 time-steps per episode). Each sample of the environment from the distribution of possible tasks changes both the location of the jewels and the walls of the maze. Figure 1 shows environment modification examples. We also use the Coin Run environment from the Proc Gen suite of tasks to evaluate the set of candidate interventions Cobbe et al. (2019, 2020). Proc Gen is built using procedural generation Cobbe et al. (2020), making it possible to study the same three distribution-shift conditions which were considered in the gridworld experiments. Montezuma s Revenge is an Atari game that is often used to benchmark the quality of exploration procedures in RL Bellemare et al. (2013); Salimans and Chen (2018). |
| Dataset Splits | No | The paper describes training over multiple 'rounds' with changing environment distributions and distinguishes between 'training performance' and 'test performance' in figures. However, it does not explicitly provide specific percentages, sample counts, or a detailed methodology for traditional train/validation/test dataset splits, nor does it mention a separate validation set. |
| Hardware Specification | Yes | All experiments are conducted using either a single P100 or V100 GPU on a cloud machine. |
| Software Dependencies | No | The paper mentions using PPO, Neuro Nav library, and Adam optimizer but does not provide specific version numbers for any software dependencies like programming languages or deep learning frameworks. |
| Experiment Setup | Yes | Table 1: Optimal hyperparameter values used for different environments and interventions. Chosen values are the result of performing a sweep over possible values using permute environment shift condition. Table 2: Default values used in PPO algorithm for all gridworld and Coin Run experiments. |