Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Improving Intrinsic Exploration by Creating Stationary Objectives
Authors: Roger Creus Castanyer, Joshua Romoff, Glen Berseth
ICLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 5 EXPERIMENTS SOFE is designed to improve the performance of exploration tasks. To evaluate its efficacy, we study three questions: (1) How much does SOFE facilitate the optimization of non-stationary exploration bonuses? (2) Does this increased stationarity improve exploration for downstream tasks? (3) How well does SOFE scale to image-based state inputs where approximations are needed to estimate state-visitation frequencies? To answer each of these research questions, we run the experiments as follows. |
| Researcher Affiliation | Collaboration | Roger Creus Castanyer Mila Qu ebec AI Institute Universit e de Montr eal Joshua Romoff Ubisoft La Forge EMAIL Glen Berseth Mila Qu ebec AI Institute Universit e de Montr eal EMAIL |
| Pseudocode | Yes | A.7 STATE-ENTROPY MAXIMIZATION In this section, we provide the pseudo-code for the surprise-maximization algorithm presented in Section 3.1.3. ... Algorithm 1 Surprise Maximization |
| Open Source Code | No | Videos of the trained agents and summarized findings can be found on our supplementary webpage1. |
| Open Datasets | Yes | Deep Sea sparse-reward hard-exploration task from the Deep Mind suite (Osband et al., 2019); Mini Hack-Multi Room-N6-v0 task, originally used for E3B in Henaff et al. (2023); Procgen-Maze task (Cobbe et al., 2020); Habitat environment (Szot et al., 2021); HM3D dataset (Ramakrishnan et al., 2021) |
| Dataset Splits | No | No explicit statement of training, validation, and test dataset splits with percentages or counts was found. The paper focuses on experimental setups within reinforcement learning environments. |
| Hardware Specification | No | We optimize the E3B exploration bonus with PPO (Schulman et al., 2017) which requires 31 hours in a machine with a single GPU. |
| Software Dependencies | No | We use Stable-Baselines3 (Raffin et al., 2021) to run our experiments in the mazes, Godot maps, and Deep Sea. |
| Experiment Setup | Yes | A.3 TRAINING DETAILS; Table 2: Hyperparameters for the DQN Implementation; Table 3: Hyperparameters for the PPO Implementation; Table 4: Hyperparameters for the A2C Implementation |