reproducibilityindex.ai

Bridging RL Theory and Practice with the Effective Horizon

Authors: Cassidy Laidlaw, Stuart J Russell, Anca Dragan

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We compare standard deep RL algorithms to prior sample complexity bounds by introducing a new dataset, BRIDGE. It consists of 155 deterministic MDPs from common deep RL benchmarks, along with their corresponding tabular representations, which enables us to exactly compute instance-dependent bounds. ... Using BRIDGE, we show that the effective horizon-based bounds are more closely reflective of the empirical performance of PPO and DQN than prior sample complexity bounds across four metrics.
Researcher Affiliation	Academia	Cassidy Laidlaw Stuart Russell Anca Dragan Unversity of California, Berkeley {cassidy_laidlaw,russell,anca}@cs.berkeley.edu
Pseudocode	Yes	Algorithm 1 The Greedy Over Random Policy (GORP) algorithm, used to motivate the effective horizon.
Open Source Code	Yes	Our code and data are available at https://github.com/cassidylaidlaw/effective-horizon.
Open Datasets	Yes	Our code and data are available at https://github.com/cassidylaidlaw/effective-horizon.
Dataset Splits	No	The paper describes running 'evaluation episodes' during training and using '5 random seeds' for empirical sample complexity calculation, but does not specify explicit train/validation/test dataset splits with percentages or counts.
Hardware Specification	Yes	For deep RL experiments, we used a mix of A100, A4000, and A6000 GPUs from Nvidia. We ran the algorithms either on separate GPUs or sometimes we ran multiple random seeds simultaneously on the same hardware. We used 1-8 CPU threads to run the RL environments.
Software Dependencies	No	We use the implementations of PPO and DQN from Stable-Baselines3 (SB3) [58]. The paper mentions the library but does not provide specific version numbers for SB3 or other software dependencies.
Experiment Setup	Yes	We use the following hyperparameters for PPO: Hyperparameter Value Training timesteps 5,000,000 Number of environments 8 Number of steps per rollout {128, 1280} Clipping parameter (ϵ) 0.1 Value function coefficient 0.5 Entropy coefficient 0.01 Optimizer Adam Learning rate 2.5 10 4 Number of epochs per training batch 4 Minibatch size 256 GAE coefficient (λ) 0.95 Advantage normalization Yes Gradient clipping 0.5