The Phenomenon of Policy Churn

Authors: Tom Schaul, Andre Barreto, John Quan, Georg Ostrovski

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We characterise the phenomenon empirically, verifying that it is not limited to specific algorithm or environment properties. A number of ablations help whittle down the plausible explanations on why churn occurs, the most likely one being deep learning with high-variance updates. Figure 1: Average amount of policy change W (Eq. 2) per update, in two deep RL agents (Double DQN and R2D2). Points are averages over 3 seeds, on one of 15 (colour-coded) Atari games.
Researcher Affiliation Industry Tom Schaul Deep Mind London, UK André Barreto Deep Mind London, UK John Quan Deep Mind London, UK Georg Ostrovski Deep Mind London, UK {tom,andrebarreto,johnquan,ostrovski}@deepmind.com
Pseudocode No No pseudocode or algorithm blocks are explicitly labeled or formatted as such in the paper.
Open Source Code No Our implementation relies on some proprietary code.
Open Datasets Yes in a typical deep RL set-up such as DQN on Atari
Dataset Splits No arg max actions at the points in training where the target network lags behind by just one update. Figure 1 shows typical values for W on a few Atari games, estimated by comparing the policies induced by online and (one update old) target networks, on batches of experience sampled from the agent s replay buffer.
Hardware Specification No We train all agents on a custom internal cluster.
Software Dependencies No JAX: composable transformations of Python+Num Py programs, 2018., Haiku: Sonnet for JAX, 2020., RLax: Reinforcement Learning in JAX, 2020., Reverb: An efficient data storage and transport system for ML research, 2020., Optax: Composable gradient transformation and optimisation, in JAX!, 2020., Chex: Testing made fun, in JAX!, 2020.
Experiment Setup Yes Table 1: The two agent setups considered differ in a number of properties. Agent Double DQN R2D2 Input 84 84 grayscale 210 160 RGB Action set minimal per game: 3 |A| 18 full: |A| = 18 Reward clipped unclipped Neural net feed-forward, 1.7M parameters recurrent, 5.5M parameters Q-value head regular dueling Update 1-step double Q-learning 5-step double Q-learning Optimiser RMSProp without momentum Adam with momentum = 0.9 Batch size 32 32 80 = 2560 Replay, replay ratio uniform, 8 prioritised (exponent 0.9), 1 Parallel actors 1 192 Mean W per update 9% 6%