Policy-shaped prediction: avoiding distractions in model-based reinforcement learning
Authors: Miles Hutson, Isaac Kauvar, Nick Haber
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To evaluate the model s performance we design our experiments around the following questions: Q1. Is our agent robust against distractors which are learnable by the world model, but of no utility for the actor-critic? ... 3.1 Experimental details Baselines We test four Model-Based RL approaches as baselines: Dreamer V3 [Hafner et al., 2023], and three methods specifically designed to handle distractions Task Informed Abstractions [Fu et al., 2021], Denoised MDP (method in their Figure 2b) [Wang et al., 2022], and Dreamer Pro [Deng et al., 2022]. ... Table 1: Performance comparison across environments. |
| Researcher Affiliation | Academia | Miles Hutson Stanford University hutson@stanford.edu Isaac Kauvar Stanford University ikauvar@stanford.edu Nick Haber Stanford University nhaber@stanford.edu |
| Pseudocode | Yes | Algorithm 1 Policy-Shaped Prediction training (for Dreamer V3) |
| Open Source Code | Yes | The repository with code and instructions for reproducing these experiments is available at this Git Hub Repository. ... An anonymized version of the code will be available at the linked Git Hub Repository for reviewers. |
| Open Datasets | Yes | We test performance in three environments: Deep Mind Control Suite (DMC) [Tassa et al., 2018], Reafferent DMC (described below), and Distracting Control Suite [Stone et al., 2021] (with background video initialized to a random frame each episode, 2,000 grayscale frames from the 'driving car' Kinetics dataset [Kay et al., 2017]). |
| Dataset Splits | No | The paper uses established environments like Deep Mind Control Suite and Distracting Control Suite but does not explicitly detail the specific training, validation, and test dataset splits used for its experiments, nor does it reference predefined splits with specific percentages or counts. |
| Hardware Specification | Yes | Each trial of the PSP method used 4 Nvidia A40 GPUs to train the modified Dreamer V3 model, and 4 A40 GPUs to run the Segment Anything model in parallel. ... Baseline trials could be run on only a single A40 GPU or a desktop NVIDIA 2070 SUPER, usually in less than a day, and accounted for a comparably negligible level of resources. |
| Software Dependencies | No | The paper mentions software components like Python, Jax, Segment Anything Model (SAM), and Dreamer V3, but does not provide specific version numbers for these, or any other, key software libraries or dependencies. |
| Experiment Setup | Yes | When updating θ during world model training, we subtract the scaled gradient ϵ θL(ˆat 1, at 1) from the overall world model gradient, with ϵ = 1e3. ... As a regularizer, we linearly interpolate between the salience weighting and a uniform weighting, with α = 0.9 for all our experiments... To ignore any exploding gradients, we clip the raw salience map to the 99th percentile before aggregation. ... For all agents, we use 3 random seeds per task, and default hyperparameters. |