Policy-shaped prediction: avoiding distractions in model-based reinforcement learning

Authors: Miles Hutson, Isaac Kauvar, Nick Haber

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To evaluate the model s performance we design our experiments around the following questions: Q1. Is our agent robust against distractors which are learnable by the world model, but of no utility for the actor-critic? ... 3.1 Experimental details Baselines We test four Model-Based RL approaches as baselines: Dreamer V3 [Hafner et al., 2023], and three methods specifically designed to handle distractions Task Informed Abstractions [Fu et al., 2021], Denoised MDP (method in their Figure 2b) [Wang et al., 2022], and Dreamer Pro [Deng et al., 2022]. ... Table 1: Performance comparison across environments.
Researcher Affiliation Academia Miles Hutson Stanford University hutson@stanford.edu Isaac Kauvar Stanford University ikauvar@stanford.edu Nick Haber Stanford University nhaber@stanford.edu
Pseudocode Yes Algorithm 1 Policy-Shaped Prediction training (for Dreamer V3)
Open Source Code Yes The repository with code and instructions for reproducing these experiments is available at this Git Hub Repository. ... An anonymized version of the code will be available at the linked Git Hub Repository for reviewers.
Open Datasets Yes We test performance in three environments: Deep Mind Control Suite (DMC) [Tassa et al., 2018], Reafferent DMC (described below), and Distracting Control Suite [Stone et al., 2021] (with background video initialized to a random frame each episode, 2,000 grayscale frames from the 'driving car' Kinetics dataset [Kay et al., 2017]).
Dataset Splits No The paper uses established environments like Deep Mind Control Suite and Distracting Control Suite but does not explicitly detail the specific training, validation, and test dataset splits used for its experiments, nor does it reference predefined splits with specific percentages or counts.
Hardware Specification Yes Each trial of the PSP method used 4 Nvidia A40 GPUs to train the modified Dreamer V3 model, and 4 A40 GPUs to run the Segment Anything model in parallel. ... Baseline trials could be run on only a single A40 GPU or a desktop NVIDIA 2070 SUPER, usually in less than a day, and accounted for a comparably negligible level of resources.
Software Dependencies No The paper mentions software components like Python, Jax, Segment Anything Model (SAM), and Dreamer V3, but does not provide specific version numbers for these, or any other, key software libraries or dependencies.
Experiment Setup Yes When updating θ during world model training, we subtract the scaled gradient ϵ θL(ˆat 1, at 1) from the overall world model gradient, with ϵ = 1e3. ... As a regularizer, we linearly interpolate between the salience weighting and a uniform weighting, with α = 0.9 for all our experiments... To ignore any exploding gradients, we clip the raw salience map to the 99th percentile before aggregation. ... For all agents, we use 3 random seeds per task, and default hyperparameters.