The Primacy Bias in Deep Reinforcement Learning
Authors: Evgenii Nikishin, Max Schwarzer, Pierluca D’Oro, Pierre-Luc Bacon, Aaron Courville
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This work identifies a common flaw of deep reinforcement learning (RL) algorithms: a tendency to rely on early interactions and ignore useful evidence encountered later. Through a series of experiments, we dissect the algorithmic aspects of deep RL that exacerbate this bias. We then propose a simple yet generally-applicable mechanism that tackles the primacy bias by periodically resetting a part of the agent. We apply this mechanism to algorithms in both discrete (Atari 100k) and continuous action (Deep Mind Control Suite) domains, consistently improving their performance. |
| Researcher Affiliation | Academia | 1Mila, Universit e de Montr eal. |
| Pseudocode | No | No structured pseudocode or algorithm blocks were found in the paper. |
| Open Source Code | Yes | We use an open-source JAX implementation (Kostrikov, 2021) of the SAC and Dr Q algorithms and an open-source JAX implementation2 of SPR. This SPR implementation exhibit a slightly higher aggregate performance than the scores in Schwarzer et al. (2020) based on a Py Torch implementation. 2https://github.com/MaxASchwarzer/dopamine/tree/atari100k_spr |
| Open Datasets | Yes | We focus on two domains: discrete control, represented by the 26-task Atari 100k benchmark (Kaiser et al., 2019), and continuous control, represented by the Deep Mind Control Suite (Tassa et al., 2020). |
| Dataset Splits | No | The paper mentions using the Atari 100k benchmark and Deep Mind Control Suite but does not explicitly state the train/validation/test dataset splits, only mentions evaluation over seeds and environment steps. |
| Hardware Specification | No | The paper mentions 'Compute Canada for computational resources' but does not provide specific hardware details such as GPU models, CPU models, or detailed computer specifications used for experiments. |
| Software Dependencies | No | The paper mentions using JAX, Jupyter, Matplotlib, numpy, pandas, and SciPy, and cites their original papers, but does not provide specific version numbers for these software dependencies, except implicitly for JAXRL by Kostrikov (2021). |
| Experiment Setup | Yes | We use default hyperparameters, which imply a single update for both policy and value function per step in the environment. For SPR, we reset only the final linear layer of the 5-layer Q-network over the course of training spaced 2 x 10^4 steps apart; for SAC, we reset agent's networks entirely every 2 x 10^5 steps since the networks have only 3 layers; for Dr Q, we reset the last 3 out of 7 layers of the policy and value networks 10 times over the course of training. The paper discusses 'replay ratio' and 'n-step targets' with values like '9 updates per step', 'replay ratio of 32', 'n=20' etc. |