Hyperparameters in Reinforcement Learning and How To Tune Them
Authors: Theresa Eimer, Marius Lindauer, Roberta Raileanu
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We support this by comparing multiple state-of-the-art HPO tools on a range of RL algorithms and environments to their hand-tuned counterparts, demonstrating that HPO approaches often have higher performance and lower compute overhead. |
| Researcher Affiliation | Collaboration | Theresa Eimer 1 * Marius Lindauer 1 Roberta Raileanu 2 1Leibniz University Hannover 2Meta AI. |
| Pseudocode | No | The paper describes various algorithms and methods (e.g., Random Search, DEHB, PBT variants) but does not include any pseudocode or clearly labeled algorithm blocks in its main content or appendices. |
| Open Source Code | Yes | In order to encourage the adoption of these practices, we provide plug-and-play implementations of the tuning algorithms used in this paper at https://github.com/ facebookresearch/how-to-autorl. |
| Open Datasets | Yes | We use basic gym environments such as Open AI s Pendulum and Acrobot (Brockman et al., 2016), gridworld with an exploration component such as Mini Grid s Empty and Door Key 5x5 (Chevalier-Boisvert et al., 2018), as well as robot locomotion tasks such as Brax s Ant, Halfcheetah and Humanoid (Freeman et al., 2021). |
| Dataset Splits | Yes | To give an idea of the reliability of both the tuning algorithm and the found configurations, we tune each setting 3 times across 5 seeds and test the best-found configuration on 10 unseen test seeds. |
| Hardware Specification | Yes | All of our experiments were run on a compute cluster with two Intel CPUs per node (these were used for the experiments in Section 4) and four different node configurations for GPU (used for the experiments in Section 5). These configurations are: 2 Pascal 130 GPUs, 2 Pascal 144 GPUs, 8 Volta 10 GPUs or 8 Volta 332. We ran the CPU experiments with 10GB of memory on single nodes and the GPU experiments with 10GB for Procgen and 40GB for Brax on a single GPU each. |
| Software Dependencies | No | The paper mentions software components like 'Stable Baselines3 implementations', 'Optuna implementation', 'Hydra sweepers', and 'ray' but does not provide specific version numbers for these or other key libraries used in their experimental setup. Appendix A includes a checklist with placeholders for software versions (e.g., 'a conda environment file with all package versions'), but the actual versions used are not explicitly stated in the paper's text. |
| Experiment Setup | Yes | For each environment, we sweep over 8 hyperparameters for DQN, 7 for SAC and 11 for PPO (for a full list, see Appendix E). We use a total budget of 10 full RL runs for all methods. |