The Phenomenon of Policy Churn
Authors: Tom Schaul, Andre Barreto, John Quan, Georg Ostrovski
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We characterise the phenomenon empirically, verifying that it is not limited to specific algorithm or environment properties. A number of ablations help whittle down the plausible explanations on why churn occurs, the most likely one being deep learning with high-variance updates. Figure 1: Average amount of policy change W (Eq. 2) per update, in two deep RL agents (Double DQN and R2D2). Points are averages over 3 seeds, on one of 15 (colour-coded) Atari games. |
| Researcher Affiliation | Industry | Tom Schaul Deep Mind London, UK André Barreto Deep Mind London, UK John Quan Deep Mind London, UK Georg Ostrovski Deep Mind London, UK {tom,andrebarreto,johnquan,ostrovski}@deepmind.com |
| Pseudocode | No | No pseudocode or algorithm blocks are explicitly labeled or formatted as such in the paper. |
| Open Source Code | No | Our implementation relies on some proprietary code. |
| Open Datasets | Yes | in a typical deep RL set-up such as DQN on Atari |
| Dataset Splits | No | arg max actions at the points in training where the target network lags behind by just one update. Figure 1 shows typical values for W on a few Atari games, estimated by comparing the policies induced by online and (one update old) target networks, on batches of experience sampled from the agent s replay buffer. |
| Hardware Specification | No | We train all agents on a custom internal cluster. |
| Software Dependencies | No | JAX: composable transformations of Python+Num Py programs, 2018., Haiku: Sonnet for JAX, 2020., RLax: Reinforcement Learning in JAX, 2020., Reverb: An efficient data storage and transport system for ML research, 2020., Optax: Composable gradient transformation and optimisation, in JAX!, 2020., Chex: Testing made fun, in JAX!, 2020. |
| Experiment Setup | Yes | Table 1: The two agent setups considered differ in a number of properties. Agent Double DQN R2D2 Input 84 84 grayscale 210 160 RGB Action set minimal per game: 3 |A| 18 full: |A| = 18 Reward clipped unclipped Neural net feed-forward, 1.7M parameters recurrent, 5.5M parameters Q-value head regular dueling Update 1-step double Q-learning 5-step double Q-learning Optimiser RMSProp without momentum Adam with momentum = 0.9 Batch size 32 32 80 = 2560 Replay, replay ratio uniform, 8 prioritised (exponent 0.9), 1 Parallel actors 1 192 Mean W per update 9% 6% |