Reward Scale Robustness for Proximal Policy Optimization via DreamerV3 Tricks

Authors: Ryan Sullivan, Akarsh Kumar, Shengyi Huang, John Dickerson, Joseph Suarez

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our work applies Dreamer V3 s tricks to PPO and is the first such empirical study outside of the original work. Surprisingly, we find that the tricks presented do not transfer as general improvements to PPO. We use a high quality PPO reference implementation and present extensive ablation studies totaling over 10,000 A100 hours on the Arcade Learning Environment and the Deep Mind Control Suite.
Researcher Affiliation Academia Ryan Sullivan rsulli@umd.edu University of Maryland College Park, MD, USA Akarsh Kumar akarshkumar0101@gmail.com Massachusetts Institute of Technology Cambridge, MA, USA Shengyi Huang costa.huang@outlook.com Drexel University Philadelphia, PA, USA John P. Dickerson johnd@umd.edu University of Maryland College Park, MD, USA Joseph Suarez jsuarez@mit.edu Massachusetts Institute of Technology Cambridge, MA, USA
Pseudocode No The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code Yes Finally, we have open-sourced the implementation of our tricks and experiments https://github.com/Ryan Navillus/PPO-v3.
Open Datasets Yes We test our method on 35 proprioceptive control environments from the Deep Mind Control Suite. We train agents on each of the 57 environments in the Arcade Learning Environment for multiple seeds in each of our ablations.
Dataset Splits No The paper describes training agents on specific environments and using multiple seeds for ablations but does not explicitly provide details about train/validation/test dataset splits, which are less common for continuous interaction with RL environments as opposed to static datasets.
Hardware Specification Yes We report results using standard metrics as well as those provided by the RLiable library as recommended in Agarwal et al. 16 when results require uncertainty measures. They describe a methodology and metrics for creating reproducible results with only a handful of runs by reporting uncertainty. These metrics include 95% stratified bootstrap confidence intervals for the mean, median, interquartile mean (IQM), and optimality gap (the amount by which an algorithm fails to meet a minimum normalized score of 1). We used approximately 8000 GPUs hours for the experiments in this paper as well as 4000 more for testing and development, most of which were run on Nvidia A100s.
Software Dependencies No We implement these tricks as minimal extensions to Clean RL s extensively validated and benchmarked PPO implementation. ... automatic hyperparameter tuning of the learning rate, entropy coefficient, and value loss coefficient using Optuna. The paper mentions 'Clean RL' and 'Optuna' but does not specify their version numbers.
Experiment Setup Yes We performed manual hyperparameter tuning for each of the implementation tricks, as well as automatic hyperparameter tuning of the learning rate, entropy coefficient, and value loss coefficient using Optuna... We use the critic EMA to regularize the critic loss using the same decay rate (0.98) and regularizer coefficient (1.0) as Dreamer V3. ...when using two-hot encoding we initialize the critic logits to zero... when twohot encoding is disabled, for a critic f(x, θ) with inputs x and parameters θ we use MSE loss to predict the symlog transformed returns y. ...when symlog is enabled, we follow Dreamer V3 and set the range of the twohot bins to [-20, 20]... When symlog is disabled, we instead choose a range of [-15000, 15000] for Atari environments without reward clipping enabled, and [-1000, 1000] for Atari environments with reward clipping enabled as well as the Deep Mind Control Suite... we change the percentile EMA decay rate from 0.99 to 0.995 for the Deep Mind Control Suite and 0.999 for the Arcade Learning Environment. ...We do not experiment with changing the unimix ratio in this paper and use 1% as in Hafner et al. 3.