Gray-Box Gaussian Processes for Automated Reinforcement Learning

Authors: Gresa Shala, André Biedenkapp, Frank Hutter, Josif Grabocka

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In a very large-scale experimental protocol, comprising 5 popular RL methods (DDPG, A2C, PPO, SAC, TD3), 22 environments (Open AI Gym: Mujoco, Atari, Classic Control), and 7 HPO baselines, we demonstrate that our method significantly outperforms current HPO practices in RL.
Researcher Affiliation Collaboration Gresa Shala1, Andr e Biedenkapp1, Frank Hutter1,2, Josif Grabocka1 1Department of Computer Science, University of Freiburg 2Bosch Center for Artificial Intelligence
Pseudocode Yes Algorithm 1: Gray-Box HPO for RL
Open Source Code Yes To ensure reproducibility (another issue in modern RL) and broad use of RGCP, all our code is open-sourced at https://github.com/releaunifreiburg/RCGP.
Open Datasets Yes We evaluated static hyperparameter optimization (HPO) methods by querying Auto RL-Bench1, which is a tabular benchmark for Auto RL that contains reward curves for three different random seeds belonging to runs of RL algorithms with every possible combination of hyperparameter values from the search spaces shown in Table 1.
Dataset Splits No The paper evaluates HPO methods on pre-recorded reward curves from a benchmark (Auto RL-Bench) rather than on a dataset with explicit train/validation/test splits. While it mentions training steps and evaluation seeds, it does not specify percentages or counts for distinct dataset partitions.
Hardware Specification No The paper does not provide specific hardware details such as GPU or CPU models, processor types, or memory specifications used for running the experiments.
Software Dependencies No The paper mentions software like 'GPytorch' and 'Ray Tune library' but does not provide specific version numbers for these dependencies.
Experiment Setup Yes We focus on evaluating the performance of our proposed method, RCGP (Reward-Curve GP), for optimizing the hyperparameters of five popular model-free RL algorithms: PPO (Schulman et al., 2017), A2C Mnih et al. (2016), DDPG (Lillicrap et al., 2016), SAC (Haarnoja et al., 2018), and TD3 (Fujimoto et al., 2018). In total, we consider 22 distinct Gym (Brockman et al., 2016) environments, grouped into the Atari (Bellemare et al., 2013), Classic Control, and Mujoco (Todorov et al., 2012) categories. We denote the full list of environments and their respective action space types in Appendix A, and we list the search spaces for the hyperparameters of each RL algorithm in Table 1.