Curious Replay for Model-based Adaptation

Authors: Isaac Kauvar, Chris Doyle, Linqi Zhou, Nick Haber

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Agents using Curious Replay exhibit improved performance in an exploration paradigm inspired by animal behavior and on the Crafter benchmark. Dreamer V3 with Curious Replay surpasses state-of-the-art performance on Crafter, achieving a mean score of 19.4 that substantially improves on the previous high score of 14.5 by Dreamer V3 with uniform replay, while also maintaining similar performance on the Deepmind Control Suite.
Researcher Affiliation Academia *Equal contribution 1Graduate School of Education, Stanford University, Stanford, CA, USA 2Department of Computer Science, Stanford University, Stanford, CA, USA. Correspondence to: Isaac Kauvar <ikauvar@stanford.edu>, Nick Haber <nhaber@stanford.edu>.
Pseudocode Yes Algorithm 1 Curious Replay Input: Replay buffer R that uses a Sum Tree structure to store the priority pi of each transition Hyperparameters: c, β, α, ϵ, environment steps per train step L, batch size B, maximum priority p MAX for iteration 1, 2, ... do Collect L transitions (xt, at, rt, xt+1) with policy Add transitions to replay buffer R, each with priority pi p MAX and visit count vi 0 Sample batch of B transitions from R using probability for selecting transition i as pi/ P|R| j=1 pj Train world model and policy using batch, and cache loss Li for each transition in batch for transition i in batch do pi cβvi + (|Li| + ϵ)α (See equation 1) vi vi + 1 end for end for
Open Source Code Yes Code for Curious Replay is available at github.com/Autonomous Agents Lab/curiousreplay.
Open Datasets Yes We investigate adaptation of model-based agents in three settings: an intrinsically-motivated object interaction assay, variants of the Deepmind Control Suite, and Crafter (Hafner, 2021). To investigate agent adaptation in an extrinsically-rewarded, changing environment, we modify the Deepmind Control Suite (Tassa et al., 2018) to make it a changing environment.
Dataset Splits No The paper describes how models are trained on data from the replay buffer and evaluated on test episodes, but it does not specify explicit numerical splits (e.g., percentages or counts) for training, validation, and test datasets, which is common in supervised learning but less directly applicable in RL with interactive data collection.
Hardware Specification Yes Model training and evaluation used Google Cloud T4 GPU instances.
Software Dependencies Yes Dreamer is implemented in Tensorflow2. We use the STArr python package as a fast Sum Tree implementation for Dreamer V2 and Dreamer Pro. For Curious Replay, a running minimum (across the entire run) was subtracted from the loss before a priority value was computed. We did not use the running minimum when prioritizing with temporal difference. Dreamer is implemented in Tensorflow2. We customize the Mu Jo Co-based dm control library for object interaction assay implementation, with a control timestep of 0.03 s and a physics simulation timestep of 0.005 s. For the Background-Swap Control Suite, a ground-plane alpha of 0.1 is used. Code will be made publicly available. In Dreamer V3, sequences of length 64 are stored in the replay buffer. In the Curious Replay implementation for Dreamer V3, the probability of training on a sequence is based on the priority calculated for the last step of the sequence. The Reverb replay buffer (Cassirer et al., 2021) is used to store the sequences and priorities and to select the samples. After each training step, the training count and priority for each step in the sequence is updated. Unlike in the Dreamer V2 implementation, the loss used to calculate the priority is not adjusted by the running minimum.
Experiment Setup Yes Hyperparameters Agents used defaults for Dreamer V2, Dreamer V3, Plan2Explore, and Dreamer Pro. For Curious Replay, β = 0.7, α = 0.7, c = 1e4, ϵ = 0.01, and p MAX = 1e5. These were optimized on the object interaction assay and fixed across all environments. For object interaction and Control Suite, batches are B = 10 sequences of fixed length L = 50. For continuous control tasks, we used a nondiscrete latent space, which we found performed better. For Crafter, we used a slightly more updated version of Dreamer V2 that includes layer normalization, B = 16 and a discrete latent space. Model training and evaluation used Google Cloud T4 GPU instances. Each agent s online return is logged every episode for Control Suite, and every 20 steps for object interaction. Episode length is 1K steps for Control Suite and 100K steps for object interaction to allow for substantial uninterrupted exploration. In Crafter, episode length depends on the agent s survival. An action repeat of 2 (Hafner et al., 2019a) is applied across object interaction and Control Suite environments, with no action repeat for Crafter.