reproducibilityindex.ai

A Method for Evaluating Hyperparameter Sensitivity in Reinforcement Learning

Authors: Jacob Adkins, Michael Bowling, Adam White

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This work proposes a new empirical methodology for studying, comparing, and quantifying the sensitivity of an algorithm s performance to hyperparameter tuning for a given set of environments. We then demonstrate the utility of this methodology by assessing the hyperparameter sensitivity of several commonly used normalization variants of PPO.
Researcher Affiliation	Academia	Jacob Adkins Department of Computing Science University of Alberta; Amii Edmonton, Canada jadkins@ualberta.ca Michael Bowling Department of Computing Science University of Alberta; Amii Edmonton, Canada mbowling@ualberta.ca Adam White Department of Computing Science University of Alberta; Amii Edmonton, Canada amw8@ualberta.ca
Pseudocode	No	The paper describes the algorithms textually but does not include explicit pseudocode or algorithm blocks.
Open Source Code	Yes	We will release code and experiment data at https://github.com/jadkins99/hyperparameter_ sensitivity, promoting the further investigation of hyperparameter sensitivity in the field of reinforcement learning. Code is included in supplementary material. Instructions are in README file.
Open Datasets	Yes	We performed a large-scale hyperparameter study over variants of the PPO algorithm consisting of over 4.3 million runs (13 trillion environment steps) in the Brax Mu Jo Co domains (Freeman et al., 2021). The environments used in the experiments were the Brax implementations of Ant, Halfcheetah, Hopper, Swimmer, and Walker2d.(Freeman et al., 2021).
Dataset Splits	No	The paper describes hyperparameter tuning and evaluation metrics (e.g., AUC, confidence intervals over runs) but does not define traditional train/validation/test dataset splits as commonly found in supervised learning, which is typical for reinforcement learning research where agents interact with environments.
Hardware Specification	Yes	Our study ran for approximately 4.5 GPU years on NVIDIA 32GB V100s.
Software Dependencies	No	The PPO implementation used was heavily inspired by the Pure Jax RL PPO implementation (Lu et al., 2022). Separate ADAM optimizers (Kingma & Ba, 2015) were used for training the actor and critic networks. However, specific version numbers for these or other software dependencies are not provided.
Experiment Setup	Yes	The policy and critic networks were parametrized by fully connected MLP networks, each with two hidden layers of 256 units. The network used the tanh activation function. The hyperparameter sweeps were grid searches over eligibility trace λ {0.1, 0.3, 0.5, 0.7, 0.9}, entropy regularizer coefficient τ {0.001, 0.01, 0.1, 1.0, 10.0}, actor step-size αθ {0.00001, 0.0001, 0.001, 0.01, 0.1}, and critic step-size αw {0.00001, 0.0001, 0.001, 0.01, 0.1}.