Provably Efficient Online Hyperparameter Optimization with Population-Based Bandits

Authors: Jack Parker-Holder, Vu Nguyen, Stephen J. Roberts

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show in a series of RL experiments that PB2 is able to achieve high performance with a modest computational budget. Furthermore, we show in a series of RL experiments that PB2 is able to achieve high rewards with a modest computational budget.
Researcher Affiliation Academia Jack Parker-Holder University of Oxford jackph@robots.ox.ac.uk Vu Nguyen University of Oxford vu@robots.ox.ac.uk Stephen J. Roberts University of Oxford sjrob@robots.ox.ac.uk
Pseudocode Yes Algorithm 1: Population-Based Bandit Optimization (PB2)
Open Source Code Yes 1See code here: https://github.com/jparkerholder/PB2.
Open Datasets Yes We consider optimizing a policy for continuous control problems from the Open AI Gym [11]. In particular, we seek to optimize the hyperparameters for Proximal Policy Optimization (PPO, [58]), for the following tasks: Bipedal Walker, Lunar Lander Continuous, Hopper and Inverted Double Pendulum.
Dataset Splits No The paper does not explicitly state training, validation, and test dataset splits with percentages or counts. It mentions running experiments with "ten seeds" and evaluating "median best performing agent" but no details on data partitioning.
Hardware Specification No The paper does not provide specific hardware details (e.g., CPU/GPU models, memory) used for running the experiments. It only mentions that the algorithm "can be run locally on most modern computers" and notes the use of the 'tune' library.
Software Dependencies No The paper mentions the use of 'tune' library [43, 42] and 'GPy' [23] but does not specify their version numbers.
Experiment Setup Yes During training, we optimize the following hyperparameters: batch size, learning rate, GAE parameter (λ, [57]) and PPO clip parameter ( ). We use the same fixed ranges across all four environments (included in the Appendix Section 8). All experiments are conducted for 10^6 environment timesteps, with the tready command triggered every 5 * 10^4 timesteps. For BO, we train each agent sequentially for 500k steps, and selects the best to train for the remaining budget. For ASHA, we initialize a population of 18 agents to compare against B = 4 and 48 agents for B = 8. These were chosen to achieve the same total budget with the grace period equal to the tready criteria for PBT and PB2. Given the typically noisy evaluation of policy gradient algorithms [24] we repeat each experiment with ten seeds.