Curiosity in Hindsight: Intrinsic Exploration in Stochastic Environments

Authors: Daniel Jarrett, Corentin Tallec, Florent Altché, Thomas Mesnard, Remi Munos, Michal Valko

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we study a natural solution derived from structural causal models of the world: Our key idea is to learn representations of the future that capture precisely the unpredictable aspects of each outcome which we use as additional input for predictions, such that intrinsic rewards only reflect the predictable aspects of world dynamics. First, we propose incorporating such hindsight representations into models to disentangle noise from novelty , yielding Curiosity in Hindsight: a simple and scalable generalization of curiosity that is robust to stochasticity. Second, we instantiate this framework for the recently introduced BYOL-Explore algorithm as our prime example, resulting in the noise-robust BYOL-Hindsight. Third, we illustrate its behavior under a variety of different stochasticities in a grid world, and find improvements over BYOL-Explore in hard-exploration Atari games with sticky actions. Notably, we show state-of-the-art results in exploring Montezuma s Revenge with sticky actions, while preserving performance in the non-sticky setting.
Researcher Affiliation Industry 1DeepMind. Correspondence: Dan Jarrett <jarrettd@google.com>.
Pseudocode Yes Algorithm 1 BYOL-Explore Algorithm 2 BYOL-Hindsight
Open Source Code No The paper does not include any explicit statement about providing open-source code for the described methodology or a link to a code repository.
Open Datasets Yes We use Atari with preprocessed grayscale 84 84-pixel images as input [77]. In all settings, we consider both intrinsic-only (no extrinsic signal) and mixed (extrinsic + intrinsic rewards) exploration regimes. We employ a Pycolab [76] maze (Figure 4): The agent spawns in the top right, and may explore past four (possibly stochastically oscillating) block elements (V1/2, H1/2), into the lower right where two coins are randomly spawned. [77] Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253 279, 2013. [76] Thomas Stepleton. The pycolab game engine, 2017. URL https://github. com/deepmind/pycolab, 2017.
Dataset Splits No The paper describes experimental protocols and training parameters (e.g., batch size, sequence length, learner steps) but does not provide specific details on training/validation/test dataset splits for reproducibility, such as percentages or sample counts.
Hardware Specification Yes In terms of computation, 400 CPU actors generate data through an inference server, using four TPUv2 for evaluating the policy. Curiosity in Hindsight is agnostic to the underlying reinforcement learning algorithm used to optimize intrinsic rewards, so all RL implementation details in BYOL-Hindsight are identical to those in the original BYOL-Explore experiments.
Software Dependencies No The paper mentions various software components and algorithms used (e.g., VMPO, Deep Res Net, GRUs, Adam optimizer) but does not provide specific version numbers for these software libraries or frameworks, which are necessary for reproducible dependency information.
Experiment Setup Yes Pop Art-style [96] reward normalization is used with step size 0.01, and rewards are subsequently rescaled by 1 γ with discount factor γ = 0.999. Pop Art normalization is also applied to the output of the value network. To train the value function, VTrace is used without off-policy correction to define temporal-difference targets for mean squared error loss with loss weight 0.5, and an entropy loss with loss weight 0.001 is added. The parameters ηinit and αinit for VMPO are initialized to 0.5, and ϵη = 0.01 and ϵα = 0.005. The top-k parameter for VMPO is set to 0.5. For optimization, the Adam optimizer is used with learning rate 10 4 and b1 = 0.9. Observation representations of size 512 and history representations of size 256. ... The closed-loop and open-loop RNNs are simple GRUs [99], with actions provided to the RNN cells, embedded to representation size 32. ... The batch size is 32 and sequence length is 128, and four TPUv2 are used in a distributed learning setup. The open loop horizon is 1 for all Pycolab experiments, and 8 for all Atari experiments. The target network EMA is 0.99. Specifically for BYOL-Hindsight, the reconstructor network is an MLP with three hidden layers of 512, which is the same as the predictor network in BYOL-Explore above. The generator network and critic network are MLPs with three hidden layers of 512. The dimension of the generator noise ϵ is 256, and the dimension of the hindsight vector is 256. The temperature parameter is 0.5, except in Montezuma s revenge where we show sensitivity to the temperature. The coefficient λ=1 for model learning. For policy optimization, we empirically observe that the value of λ has little to no contribution towards the intrinsic reward (and little to no effect on exploration); for simplicity we set λ to zero for policy optimization. For the contrastive loss, negative samples are simply taken from the batch, so the contrastive set is also batch size 32; the time dimension is not used as negatives. For optimization, the Adam optimizer is used with learning rate 10 4 and b1 = 0.9 for both the reconstruction loss and contrastive loss.