Bridging RL Theory and Practice with the Effective Horizon

Authors: Cassidy Laidlaw, Stuart J Russell, Anca Dragan

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We compare standard deep RL algorithms to prior sample complexity bounds by introducing a new dataset, BRIDGE. It consists of 155 deterministic MDPs from common deep RL benchmarks, along with their corresponding tabular representations, which enables us to exactly compute instance-dependent bounds. ... Using BRIDGE, we show that the effective horizon-based bounds are more closely reflective of the empirical performance of PPO and DQN than prior sample complexity bounds across four metrics.
Researcher Affiliation Academia Cassidy Laidlaw Stuart Russell Anca Dragan Unversity of California, Berkeley {cassidy_laidlaw,russell,anca}@cs.berkeley.edu
Pseudocode Yes Algorithm 1 The Greedy Over Random Policy (GORP) algorithm, used to motivate the effective horizon.
Open Source Code Yes Our code and data are available at https://github.com/cassidylaidlaw/effective-horizon.
Open Datasets Yes Our code and data are available at https://github.com/cassidylaidlaw/effective-horizon.
Dataset Splits No The paper describes running 'evaluation episodes' during training and using '5 random seeds' for empirical sample complexity calculation, but does not specify explicit train/validation/test dataset splits with percentages or counts.
Hardware Specification Yes For deep RL experiments, we used a mix of A100, A4000, and A6000 GPUs from Nvidia. We ran the algorithms either on separate GPUs or sometimes we ran multiple random seeds simultaneously on the same hardware. We used 1-8 CPU threads to run the RL environments.
Software Dependencies No We use the implementations of PPO and DQN from Stable-Baselines3 (SB3) [58]. The paper mentions the library but does not provide specific version numbers for SB3 or other software dependencies.
Experiment Setup Yes We use the following hyperparameters for PPO: Hyperparameter Value Training timesteps 5,000,000 Number of environments 8 Number of steps per rollout {128, 1280} Clipping parameter (ϵ) 0.1 Value function coefficient 0.5 Entropy coefficient 0.01 Optimizer Adam Learning rate 2.5 10 4 Number of epochs per training batch 4 Minibatch size 256 GAE coefficient (λ) 0.95 Advantage normalization Yes Gradient clipping 0.5