The Effective Horizon Explains Deep RL Performance in Stochastic Environments

Authors: Cassidy Laidlaw, Banghua Zhu, Stuart Russell, Anca Dragan

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, we also find that SQIRL performance strongly correlates with PPO and DQN performance in a variety of stochastic environments, supporting that our theoretical analysis is predictive of practical performance. Our code and data are available at https://github.com/cassidylaidlaw/effective-horizon.
Researcher Affiliation Academia Cassidy Laidlaw Banghua Zhu Stuart Russell Anca Dragan University of California, Berkeley {cassidy laidlaw,banghua,russell,anca}@cs.berkeley.edu
Pseudocode Yes Algorithm 1 The greedy over random policy (GORP) algorithm, used to define the effective horizon in deterministic environments. Algorithm 2 The shallow Q-iteration via reinforcement learning (SQIRL) algorithm.
Open Source Code Yes Our code and data are available at https://github.com/cassidylaidlaw/effective-horizon.
Open Datasets Yes We evaluate the algorithms in sticky-action versions of the BRIDGE environments from Laidlaw et al. (2023).
Dataset Splits No The paper states: 'During training, we evaluate the latest policy every 10,000 training timesteps for 100 episodes.' This describes an evaluation strategy during training but does not specify a train/validation/test dataset split in terms of fixed percentages or sample counts for data partitioning.
Hardware Specification No The paper mentions using 'deep neural networks' and 'convolutional neural nets' for the implementation, but does not specify any particular hardware like GPU models, CPU types, or memory sizes used for running the experiments.
Software Dependencies No We use the Stable-Baselines3 implementations of PPO and DQN (Raffin et al., 2021). The paper names a software package but does not provide specific version numbers for it or other libraries.
Experiment Setup Yes We use the following hyperparameters for PPO: (Table 4), We use the following hyperparameters for DQN: (Table 5), We use the following hyperparameters for SQIRL: (Table 6).