reproducibilityindex.ai

The Effective Horizon Explains Deep RL Performance in Stochastic Environments

Authors: Cassidy Laidlaw, Banghua Zhu, Stuart Russell, Anca Dragan

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, we also find that SQIRL performance strongly correlates with PPO and DQN performance in a variety of stochastic environments, supporting that our theoretical analysis is predictive of practical performance. Our code and data are available at https://github.com/cassidylaidlaw/effective-horizon.
Researcher Affiliation	Academia	Cassidy Laidlaw Banghua Zhu Stuart Russell Anca Dragan University of California, Berkeley {cassidy laidlaw,banghua,russell,anca}@cs.berkeley.edu
Pseudocode	Yes	Algorithm 1 The greedy over random policy (GORP) algorithm, used to define the effective horizon in deterministic environments. Algorithm 2 The shallow Q-iteration via reinforcement learning (SQIRL) algorithm.
Open Source Code	Yes	Our code and data are available at https://github.com/cassidylaidlaw/effective-horizon.
Open Datasets	Yes	We evaluate the algorithms in sticky-action versions of the BRIDGE environments from Laidlaw et al. (2023).
Dataset Splits	No	The paper states: 'During training, we evaluate the latest policy every 10,000 training timesteps for 100 episodes.' This describes an evaluation strategy during training but does not specify a train/validation/test dataset split in terms of fixed percentages or sample counts for data partitioning.
Hardware Specification	No	The paper mentions using 'deep neural networks' and 'convolutional neural nets' for the implementation, but does not specify any particular hardware like GPU models, CPU types, or memory sizes used for running the experiments.
Software Dependencies	No	We use the Stable-Baselines3 implementations of PPO and DQN (Raffin et al., 2021). The paper names a software package but does not provide specific version numbers for it or other libraries.
Experiment Setup	Yes	We use the following hyperparameters for PPO: (Table 4), We use the following hyperparameters for DQN: (Table 5), We use the following hyperparameters for SQIRL: (Table 6).