Why Target Networks Stabilise Temporal Difference Methods

Authors: Mattie Fellows, Matthew J. A. Smith, Shimon Whiteson

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In addition to our theoretical results, we experimentally evaluate our bounds on a toy domain, indicating that they are tight under relevant hyperparameter regimes. Taken together, our results lead to novel insight as to how exactly target networks affect optimisation, and when and why they are effective, leading to actionable results that can be used to further future research.
Researcher Affiliation Academia 1Department of Computer Science, University of Oxford,Oxford, United Kingdom.
Pseudocode No The paper does not contain explicitly labeled "Pseudocode" or "Algorithm" blocks.
Open Source Code No The paper does not include a statement or link indicating that the source code for the methodology is openly available.
Open Datasets Yes We evaluate the use of target networks with varying update frequencies on the well known off-policy counterexample due to Baird (1995b). For the Cartpole experiment, we use a simple DQN-style setup with a small multilayer perceptron (MLP) representing the value function. A small adjustment is made from PFPE as characterised by the paper. Instead of updating value parameters on single data points, parameter updates are averaged across a small batch. This was found to increase stability of learning in both settings, with no notable effects when comparing across independent variables. This means that, in addition to our target network, we also make use of a replay buffer which stores observed transitions. As such, data used in updates was sampled uniformly from previous transitions. The policy was ϵ-greedy, with the estimated optimal action taken with probability 1 ϵ. The environment is maintained by Open AI as part of the gym suite, and falls under MIT licensing.
Dataset Splits No The paper does not provide specific details on training, validation, or test dataset splits (e.g., percentages, sample counts, or explicit splitting methodology). It uses environments for continuous learning.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., CPU, GPU models, memory) used to run the experiments.
Software Dependencies No The paper mentions using "simple DQN" and the "Open AI gym suite", but does not provide specific version numbers for these or other software dependencies like Python, PyTorch, or TensorFlow.
Experiment Setup Yes Parameter Value Environment Parameters γ 0.99 Architecture Parameters MLP Hidden Layers 2 Hidden Layer Size 32 Nonlinearity Re LU ϵ 0.05 Training Parameters Total Target Network Updates 500 Learning Rate [0.001, 0.0005] Momentum (µ) [0, 0.01] Batch Size 500 Steps per Target Network Update (k) 5 Data Gathering Steps per Update 5 Replay Buffer Size 2500