The Effective Horizon Explains Deep RL Performance in Stochastic Environments
Authors: Cassidy Laidlaw, Banghua Zhu, Stuart Russell, Anca Dragan
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we also find that SQIRL performance strongly correlates with PPO and DQN performance in a variety of stochastic environments, supporting that our theoretical analysis is predictive of practical performance. Our code and data are available at https://github.com/cassidylaidlaw/effective-horizon. |
| Researcher Affiliation | Academia | Cassidy Laidlaw Banghua Zhu Stuart Russell Anca Dragan University of California, Berkeley {cassidy laidlaw,banghua,russell,anca}@cs.berkeley.edu |
| Pseudocode | Yes | Algorithm 1 The greedy over random policy (GORP) algorithm, used to define the effective horizon in deterministic environments. Algorithm 2 The shallow Q-iteration via reinforcement learning (SQIRL) algorithm. |
| Open Source Code | Yes | Our code and data are available at https://github.com/cassidylaidlaw/effective-horizon. |
| Open Datasets | Yes | We evaluate the algorithms in sticky-action versions of the BRIDGE environments from Laidlaw et al. (2023). |
| Dataset Splits | No | The paper states: 'During training, we evaluate the latest policy every 10,000 training timesteps for 100 episodes.' This describes an evaluation strategy during training but does not specify a train/validation/test dataset split in terms of fixed percentages or sample counts for data partitioning. |
| Hardware Specification | No | The paper mentions using 'deep neural networks' and 'convolutional neural nets' for the implementation, but does not specify any particular hardware like GPU models, CPU types, or memory sizes used for running the experiments. |
| Software Dependencies | No | We use the Stable-Baselines3 implementations of PPO and DQN (Raffin et al., 2021). The paper names a software package but does not provide specific version numbers for it or other libraries. |
| Experiment Setup | Yes | We use the following hyperparameters for PPO: (Table 4), We use the following hyperparameters for DQN: (Table 5), We use the following hyperparameters for SQIRL: (Table 6). |