Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
The Phenomenon of Policy Churn
Authors: Tom Schaul, Andre Barreto, John Quan, Georg Ostrovski
NeurIPS 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We characterise the phenomenon empirically, verifying that it is not limited to specific algorithm or environment properties. A number of ablations help whittle down the plausible explanations on why churn occurs, the most likely one being deep learning with high-variance updates. Figure 1: Average amount of policy change W (Eq. 2) per update, in two deep RL agents (Double DQN and R2D2). Points are averages over 3 seeds, on one of 15 (colour-coded) Atari games. |
| Researcher Affiliation | Industry | Tom Schaul Deep Mind London, UK AndrΓ© Barreto Deep Mind London, UK John Quan Deep Mind London, UK Georg Ostrovski Deep Mind London, UK EMAIL |
| Pseudocode | No | No pseudocode or algorithm blocks are explicitly labeled or formatted as such in the paper. |
| Open Source Code | No | Our implementation relies on some proprietary code. |
| Open Datasets | Yes | in a typical deep RL set-up such as DQN on Atari |
| Dataset Splits | No | arg max actions at the points in training where the target network lags behind by just one update. Figure 1 shows typical values for W on a few Atari games, estimated by comparing the policies induced by online and (one update old) target networks, on batches of experience sampled from the agent s replay buffer. |
| Hardware Specification | No | We train all agents on a custom internal cluster. |
| Software Dependencies | No | JAX: composable transformations of Python+Num Py programs, 2018., Haiku: Sonnet for JAX, 2020., RLax: Reinforcement Learning in JAX, 2020., Reverb: An efficient data storage and transport system for ML research, 2020., Optax: Composable gradient transformation and optimisation, in JAX!, 2020., Chex: Testing made fun, in JAX!, 2020. |
| Experiment Setup | Yes | Table 1: The two agent setups considered differ in a number of properties. Agent Double DQN R2D2 Input 84 84 grayscale 210 160 RGB Action set minimal per game: 3 |A| 18 full: |A| = 18 Reward clipped unclipped Neural net feed-forward, 1.7M parameters recurrent, 5.5M parameters Q-value head regular dueling Update 1-step double Q-learning 5-step double Q-learning Optimiser RMSProp without momentum Adam with momentum = 0.9 Batch size 32 32 80 = 2560 Replay, replay ratio uniform, 8 prioritised (exponent 0.9), 1 Parallel actors 1 192 Mean W per update 9% 6% |