Gray-Box Gaussian Processes for Automated Reinforcement Learning
Authors: Gresa Shala, André Biedenkapp, Frank Hutter, Josif Grabocka
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In a very large-scale experimental protocol, comprising 5 popular RL methods (DDPG, A2C, PPO, SAC, TD3), 22 environments (Open AI Gym: Mujoco, Atari, Classic Control), and 7 HPO baselines, we demonstrate that our method significantly outperforms current HPO practices in RL. |
| Researcher Affiliation | Collaboration | Gresa Shala1, Andr e Biedenkapp1, Frank Hutter1,2, Josif Grabocka1 1Department of Computer Science, University of Freiburg 2Bosch Center for Artificial Intelligence |
| Pseudocode | Yes | Algorithm 1: Gray-Box HPO for RL |
| Open Source Code | Yes | To ensure reproducibility (another issue in modern RL) and broad use of RGCP, all our code is open-sourced at https://github.com/releaunifreiburg/RCGP. |
| Open Datasets | Yes | We evaluated static hyperparameter optimization (HPO) methods by querying Auto RL-Bench1, which is a tabular benchmark for Auto RL that contains reward curves for three different random seeds belonging to runs of RL algorithms with every possible combination of hyperparameter values from the search spaces shown in Table 1. |
| Dataset Splits | No | The paper evaluates HPO methods on pre-recorded reward curves from a benchmark (Auto RL-Bench) rather than on a dataset with explicit train/validation/test splits. While it mentions training steps and evaluation seeds, it does not specify percentages or counts for distinct dataset partitions. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU or CPU models, processor types, or memory specifications used for running the experiments. |
| Software Dependencies | No | The paper mentions software like 'GPytorch' and 'Ray Tune library' but does not provide specific version numbers for these dependencies. |
| Experiment Setup | Yes | We focus on evaluating the performance of our proposed method, RCGP (Reward-Curve GP), for optimizing the hyperparameters of five popular model-free RL algorithms: PPO (Schulman et al., 2017), A2C Mnih et al. (2016), DDPG (Lillicrap et al., 2016), SAC (Haarnoja et al., 2018), and TD3 (Fujimoto et al., 2018). In total, we consider 22 distinct Gym (Brockman et al., 2016) environments, grouped into the Atari (Bellemare et al., 2013), Classic Control, and Mujoco (Todorov et al., 2012) categories. We denote the full list of environments and their respective action space types in Appendix A, and we list the search spaces for the hyperparameters of each RL algorithm in Table 1. |