Provably Efficient Online Hyperparameter Optimization with Population-Based Bandits
Authors: Jack Parker-Holder, Vu Nguyen, Stephen J. Roberts
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show in a series of RL experiments that PB2 is able to achieve high performance with a modest computational budget. Furthermore, we show in a series of RL experiments that PB2 is able to achieve high rewards with a modest computational budget. |
| Researcher Affiliation | Academia | Jack Parker-Holder University of Oxford jackph@robots.ox.ac.uk Vu Nguyen University of Oxford vu@robots.ox.ac.uk Stephen J. Roberts University of Oxford sjrob@robots.ox.ac.uk |
| Pseudocode | Yes | Algorithm 1: Population-Based Bandit Optimization (PB2) |
| Open Source Code | Yes | 1See code here: https://github.com/jparkerholder/PB2. |
| Open Datasets | Yes | We consider optimizing a policy for continuous control problems from the Open AI Gym [11]. In particular, we seek to optimize the hyperparameters for Proximal Policy Optimization (PPO, [58]), for the following tasks: Bipedal Walker, Lunar Lander Continuous, Hopper and Inverted Double Pendulum. |
| Dataset Splits | No | The paper does not explicitly state training, validation, and test dataset splits with percentages or counts. It mentions running experiments with "ten seeds" and evaluating "median best performing agent" but no details on data partitioning. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., CPU/GPU models, memory) used for running the experiments. It only mentions that the algorithm "can be run locally on most modern computers" and notes the use of the 'tune' library. |
| Software Dependencies | No | The paper mentions the use of 'tune' library [43, 42] and 'GPy' [23] but does not specify their version numbers. |
| Experiment Setup | Yes | During training, we optimize the following hyperparameters: batch size, learning rate, GAE parameter (λ, [57]) and PPO clip parameter ( ). We use the same fixed ranges across all four environments (included in the Appendix Section 8). All experiments are conducted for 10^6 environment timesteps, with the tready command triggered every 5 * 10^4 timesteps. For BO, we train each agent sequentially for 500k steps, and selects the best to train for the remaining budget. For ASHA, we initialize a population of 18 agents to compare against B = 4 and 48 agents for B = 8. These were chosen to achieve the same total budget with the grace period equal to the tready criteria for PBT and PB2. Given the typically noisy evaluation of policy gradient algorithms [24] we repeat each experiment with ten seeds. |