Predictive auxiliary objectives in deep RL mimic learning in the brain

Authors: Ching Fang, Kim Stachenfeld

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We find that predictive objectives improve and stabilize learning particularly in resource-limited architectures. We identify settings where longer predictive horizons better support representational transfer. Furthermore, we find that representational changes in this RL system bear a striking resemblance to changes in neural activity observed in the brain across various experiments. We evaluate the effects of predictive auxiliary objectives in a simple gridworld foraging task, and confirm that these objectives help prevent representational collapse, particularly in resource-limited networks.
Researcher Affiliation Academia Anonymous authors Paper under double-blind review. The paper is under double-blind review, so no affiliation information is available.
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not contain any statements about releasing code or providing links to a source code repository.
Open Datasets Yes We find consistent results in a CIFAR version of this task (Fig A.2D), where models equipped with a predictive auxiliary objective outperform the other two models we tested (Fig A.3C).
Dataset Splits No The paper does not provide specific dataset split information (exact percentages, sample counts, or detailed splitting methodology) for training, validation, and test sets.
Hardware Specification No The paper does not provide specific hardware details (e.g., exact GPU/CPU models, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies No The paper does not provide specific ancillary software details (e.g., library or solver names with version numbers) needed to replicate the experiment.
Experiment Setup Yes We implement a double deep Q-learning network (Van Hasselt et al., 2016) with a predictive auxiliary objective, similar to Franc ois-Lavet et al. (2019) (Fig 1A). A deep convolutional neural network E encodes observation ot at time t into a latent state zt. The state zt is used by two network heads: a Q-learning network Q(z, a) that will be used to select action at and a prediction network T(z, a) that predicts future latent states. Both Q and T are multi-layer perceptrons with one hidden layer. The weights of E, Q, T are trained end-to-end to minimize the standard double Q-learning temporal difference loss function LQ Van Hasselt et al. (2016) and a predictive auxiliary loss Lpred. The predictive auxiliary loss is similar to that of contrastive predictive coding (Oord et al., 2018). That is, Lpred = L+ + L where L+ is a positive sampling loss and L is a negative sampling loss. The positive sample loss is defined as L+ = ||τ(zt, at) zt+1 γτ(zt+1, at+1)||2, where zt = E(ot) and τ(zt, at) = zt + T(zt, at). That is, in the γ = 0 case, the network T is learning the difference between current and future latent states such that τ(zt, at) = zt + T(zt, at) zt+1. Additionally, γ modulates the predictive horizon. The negative sample loss is defined as L = exp ||zi zj||. The weights over loss terms LQ, L+, L are chosen through a small grid search over the final episode score. In each step, the network is trained on one batch of replayed transitions (batch size is 64). All error bars are standard error mean over 45 random seeds.