Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

The Phenomenon of Policy Churn

Authors: Tom Schaul, Andre Barreto, John Quan, Georg Ostrovski

NeurIPS 2022 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We characterise the phenomenon empirically, verifying that it is not limited to specific algorithm or environment properties. A number of ablations help whittle down the plausible explanations on why churn occurs, the most likely one being deep learning with high-variance updates. Figure 1: Average amount of policy change W (Eq. 2) per update, in two deep RL agents (Double DQN and R2D2). Points are averages over 3 seeds, on one of 15 (colour-coded) Atari games.
Researcher Affiliation Industry Tom Schaul Deep Mind London, UK AndrΓ© Barreto Deep Mind London, UK John Quan Deep Mind London, UK Georg Ostrovski Deep Mind London, UK EMAIL
Pseudocode No No pseudocode or algorithm blocks are explicitly labeled or formatted as such in the paper.
Open Source Code No Our implementation relies on some proprietary code.
Open Datasets Yes in a typical deep RL set-up such as DQN on Atari
Dataset Splits No arg max actions at the points in training where the target network lags behind by just one update. Figure 1 shows typical values for W on a few Atari games, estimated by comparing the policies induced by online and (one update old) target networks, on batches of experience sampled from the agent s replay buffer.
Hardware Specification No We train all agents on a custom internal cluster.
Software Dependencies No JAX: composable transformations of Python+Num Py programs, 2018., Haiku: Sonnet for JAX, 2020., RLax: Reinforcement Learning in JAX, 2020., Reverb: An efficient data storage and transport system for ML research, 2020., Optax: Composable gradient transformation and optimisation, in JAX!, 2020., Chex: Testing made fun, in JAX!, 2020.
Experiment Setup Yes Table 1: The two agent setups considered differ in a number of properties. Agent Double DQN R2D2 Input 84 84 grayscale 210 160 RGB Action set minimal per game: 3 |A| 18 full: |A| = 18 Reward clipped unclipped Neural net feed-forward, 1.7M parameters recurrent, 5.5M parameters Q-value head regular dueling Update 1-step double Q-learning 5-step double Q-learning Optimiser RMSProp without momentum Adam with momentum = 0.9 Batch size 32 32 80 = 2560 Replay, replay ratio uniform, 8 prioritised (exponent 0.9), 1 Parallel actors 1 192 Mean W per update 9% 6%