A Method for Evaluating Hyperparameter Sensitivity in Reinforcement Learning
Authors: Jacob Adkins, Michael Bowling, Adam White
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This work proposes a new empirical methodology for studying, comparing, and quantifying the sensitivity of an algorithm s performance to hyperparameter tuning for a given set of environments. We then demonstrate the utility of this methodology by assessing the hyperparameter sensitivity of several commonly used normalization variants of PPO. |
| Researcher Affiliation | Academia | Jacob Adkins Department of Computing Science University of Alberta; Amii Edmonton, Canada jadkins@ualberta.ca Michael Bowling Department of Computing Science University of Alberta; Amii Edmonton, Canada mbowling@ualberta.ca Adam White Department of Computing Science University of Alberta; Amii Edmonton, Canada amw8@ualberta.ca |
| Pseudocode | No | The paper describes the algorithms textually but does not include explicit pseudocode or algorithm blocks. |
| Open Source Code | Yes | We will release code and experiment data at https://github.com/jadkins99/hyperparameter_ sensitivity, promoting the further investigation of hyperparameter sensitivity in the field of reinforcement learning. Code is included in supplementary material. Instructions are in README file. |
| Open Datasets | Yes | We performed a large-scale hyperparameter study over variants of the PPO algorithm consisting of over 4.3 million runs (13 trillion environment steps) in the Brax Mu Jo Co domains (Freeman et al., 2021). The environments used in the experiments were the Brax implementations of Ant, Halfcheetah, Hopper, Swimmer, and Walker2d.(Freeman et al., 2021). |
| Dataset Splits | No | The paper describes hyperparameter tuning and evaluation metrics (e.g., AUC, confidence intervals over runs) but does not define traditional train/validation/test dataset splits as commonly found in supervised learning, which is typical for reinforcement learning research where agents interact with environments. |
| Hardware Specification | Yes | Our study ran for approximately 4.5 GPU years on NVIDIA 32GB V100s. |
| Software Dependencies | No | The PPO implementation used was heavily inspired by the Pure Jax RL PPO implementation (Lu et al., 2022). Separate ADAM optimizers (Kingma & Ba, 2015) were used for training the actor and critic networks. However, specific version numbers for these or other software dependencies are not provided. |
| Experiment Setup | Yes | The policy and critic networks were parametrized by fully connected MLP networks, each with two hidden layers of 256 units. The network used the tanh activation function. The hyperparameter sweeps were grid searches over eligibility trace λ {0.1, 0.3, 0.5, 0.7, 0.9}, entropy regularizer coefficient τ {0.001, 0.01, 0.1, 1.0, 10.0}, actor step-size αθ {0.00001, 0.0001, 0.001, 0.01, 0.1}, and critic step-size αw {0.00001, 0.0001, 0.001, 0.01, 0.1}. |