Why Target Networks Stabilise Temporal Difference Methods
Authors: Mattie Fellows, Matthew J. A. Smith, Shimon Whiteson
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In addition to our theoretical results, we experimentally evaluate our bounds on a toy domain, indicating that they are tight under relevant hyperparameter regimes. Taken together, our results lead to novel insight as to how exactly target networks affect optimisation, and when and why they are effective, leading to actionable results that can be used to further future research. |
| Researcher Affiliation | Academia | 1Department of Computer Science, University of Oxford,Oxford, United Kingdom. |
| Pseudocode | No | The paper does not contain explicitly labeled "Pseudocode" or "Algorithm" blocks. |
| Open Source Code | No | The paper does not include a statement or link indicating that the source code for the methodology is openly available. |
| Open Datasets | Yes | We evaluate the use of target networks with varying update frequencies on the well known off-policy counterexample due to Baird (1995b). For the Cartpole experiment, we use a simple DQN-style setup with a small multilayer perceptron (MLP) representing the value function. A small adjustment is made from PFPE as characterised by the paper. Instead of updating value parameters on single data points, parameter updates are averaged across a small batch. This was found to increase stability of learning in both settings, with no notable effects when comparing across independent variables. This means that, in addition to our target network, we also make use of a replay buffer which stores observed transitions. As such, data used in updates was sampled uniformly from previous transitions. The policy was ϵ-greedy, with the estimated optimal action taken with probability 1 ϵ. The environment is maintained by Open AI as part of the gym suite, and falls under MIT licensing. |
| Dataset Splits | No | The paper does not provide specific details on training, validation, or test dataset splits (e.g., percentages, sample counts, or explicit splitting methodology). It uses environments for continuous learning. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., CPU, GPU models, memory) used to run the experiments. |
| Software Dependencies | No | The paper mentions using "simple DQN" and the "Open AI gym suite", but does not provide specific version numbers for these or other software dependencies like Python, PyTorch, or TensorFlow. |
| Experiment Setup | Yes | Parameter Value Environment Parameters γ 0.99 Architecture Parameters MLP Hidden Layers 2 Hidden Layer Size 32 Nonlinearity Re LU ϵ 0.05 Training Parameters Total Target Network Updates 500 Learning Rate [0.001, 0.0005] Momentum (µ) [0, 0.01] Batch Size 500 Steps per Target Network Update (k) 5 Data Gathering Steps per Update 5 Replay Buffer Size 2500 |