Predictive auxiliary objectives in deep RL mimic learning in the brain
Authors: Ching Fang, Kim Stachenfeld
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We find that predictive objectives improve and stabilize learning particularly in resource-limited architectures. We identify settings where longer predictive horizons better support representational transfer. Furthermore, we find that representational changes in this RL system bear a striking resemblance to changes in neural activity observed in the brain across various experiments. We evaluate the effects of predictive auxiliary objectives in a simple gridworld foraging task, and confirm that these objectives help prevent representational collapse, particularly in resource-limited networks. |
| Researcher Affiliation | Academia | Anonymous authors Paper under double-blind review. The paper is under double-blind review, so no affiliation information is available. |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain any statements about releasing code or providing links to a source code repository. |
| Open Datasets | Yes | We find consistent results in a CIFAR version of this task (Fig A.2D), where models equipped with a predictive auxiliary objective outperform the other two models we tested (Fig A.3C). |
| Dataset Splits | No | The paper does not provide specific dataset split information (exact percentages, sample counts, or detailed splitting methodology) for training, validation, and test sets. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., exact GPU/CPU models, memory amounts, or detailed computer specifications) used for running its experiments. |
| Software Dependencies | No | The paper does not provide specific ancillary software details (e.g., library or solver names with version numbers) needed to replicate the experiment. |
| Experiment Setup | Yes | We implement a double deep Q-learning network (Van Hasselt et al., 2016) with a predictive auxiliary objective, similar to Franc ois-Lavet et al. (2019) (Fig 1A). A deep convolutional neural network E encodes observation ot at time t into a latent state zt. The state zt is used by two network heads: a Q-learning network Q(z, a) that will be used to select action at and a prediction network T(z, a) that predicts future latent states. Both Q and T are multi-layer perceptrons with one hidden layer. The weights of E, Q, T are trained end-to-end to minimize the standard double Q-learning temporal difference loss function LQ Van Hasselt et al. (2016) and a predictive auxiliary loss Lpred. The predictive auxiliary loss is similar to that of contrastive predictive coding (Oord et al., 2018). That is, Lpred = L+ + L where L+ is a positive sampling loss and L is a negative sampling loss. The positive sample loss is defined as L+ = ||τ(zt, at) zt+1 γτ(zt+1, at+1)||2, where zt = E(ot) and τ(zt, at) = zt + T(zt, at). That is, in the γ = 0 case, the network T is learning the difference between current and future latent states such that τ(zt, at) = zt + T(zt, at) zt+1. Additionally, γ modulates the predictive horizon. The negative sample loss is defined as L = exp ||zi zj||. The weights over loss terms LQ, L+, L are chosen through a small grid search over the final episode score. In each step, the network is trained on one batch of replayed transitions (batch size is 64). All error bars are standard error mean over 45 random seeds. |