Improving Intrinsic Exploration by Creating Stationary Objectives
Authors: Roger Creus Castanyer, Joshua Romoff, Glen Berseth
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 5 EXPERIMENTS SOFE is designed to improve the performance of exploration tasks. To evaluate its efficacy, we study three questions: (1) How much does SOFE facilitate the optimization of non-stationary exploration bonuses? (2) Does this increased stationarity improve exploration for downstream tasks? (3) How well does SOFE scale to image-based state inputs where approximations are needed to estimate state-visitation frequencies? To answer each of these research questions, we run the experiments as follows. |
| Researcher Affiliation | Collaboration | Roger Creus Castanyer Mila Qu ebec AI Institute Universit e de Montr eal Joshua Romoff Ubisoft La Forge joshua.romoff@ubisoft.com Glen Berseth Mila Qu ebec AI Institute Universit e de Montr eal {roger.creus-castanyer, glen.berseth}@mila.quebec |
| Pseudocode | Yes | A.7 STATE-ENTROPY MAXIMIZATION In this section, we provide the pseudo-code for the surprise-maximization algorithm presented in Section 3.1.3. ... Algorithm 1 Surprise Maximization |
| Open Source Code | No | Videos of the trained agents and summarized findings can be found on our supplementary webpage1. |
| Open Datasets | Yes | Deep Sea sparse-reward hard-exploration task from the Deep Mind suite (Osband et al., 2019); Mini Hack-Multi Room-N6-v0 task, originally used for E3B in Henaff et al. (2023); Procgen-Maze task (Cobbe et al., 2020); Habitat environment (Szot et al., 2021); HM3D dataset (Ramakrishnan et al., 2021) |
| Dataset Splits | No | No explicit statement of training, validation, and test dataset splits with percentages or counts was found. The paper focuses on experimental setups within reinforcement learning environments. |
| Hardware Specification | No | We optimize the E3B exploration bonus with PPO (Schulman et al., 2017) which requires 31 hours in a machine with a single GPU. |
| Software Dependencies | No | We use Stable-Baselines3 (Raffin et al., 2021) to run our experiments in the mazes, Godot maps, and Deep Sea. |
| Experiment Setup | Yes | A.3 TRAINING DETAILS; Table 2: Hyperparameters for the DQN Implementation; Table 3: Hyperparameters for the PPO Implementation; Table 4: Hyperparameters for the A2C Implementation |