reproducibilityindex.ai

Provably Efficient Online Hyperparameter Optimization with Population-Based Bandits

Authors: Jack Parker-Holder, Vu Nguyen, Stephen J. Roberts

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show in a series of RL experiments that PB2 is able to achieve high performance with a modest computational budget. Furthermore, we show in a series of RL experiments that PB2 is able to achieve high rewards with a modest computational budget.
Researcher Affiliation	Academia	Jack Parker-Holder University of Oxford jackph@robots.ox.ac.uk Vu Nguyen University of Oxford vu@robots.ox.ac.uk Stephen J. Roberts University of Oxford sjrob@robots.ox.ac.uk
Pseudocode	Yes	Algorithm 1: Population-Based Bandit Optimization (PB2)
Open Source Code	Yes	1See code here: https://github.com/jparkerholder/PB2.
Open Datasets	Yes	We consider optimizing a policy for continuous control problems from the Open AI Gym [11]. In particular, we seek to optimize the hyperparameters for Proximal Policy Optimization (PPO, [58]), for the following tasks: Bipedal Walker, Lunar Lander Continuous, Hopper and Inverted Double Pendulum.
Dataset Splits	No	The paper does not explicitly state training, validation, and test dataset splits with percentages or counts. It mentions running experiments with "ten seeds" and evaluating "median best performing agent" but no details on data partitioning.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., CPU/GPU models, memory) used for running the experiments. It only mentions that the algorithm "can be run locally on most modern computers" and notes the use of the 'tune' library.
Software Dependencies	No	The paper mentions the use of 'tune' library [43, 42] and 'GPy' [23] but does not specify their version numbers.
Experiment Setup	Yes	During training, we optimize the following hyperparameters: batch size, learning rate, GAE parameter (λ, [57]) and PPO clip parameter ( ). We use the same ﬁxed ranges across all four environments (included in the Appendix Section 8). All experiments are conducted for 10^6 environment timesteps, with the tready command triggered every 5 * 10^4 timesteps. For BO, we train each agent sequentially for 500k steps, and selects the best to train for the remaining budget. For ASHA, we initialize a population of 18 agents to compare against B = 4 and 48 agents for B = 8. These were chosen to achieve the same total budget with the grace period equal to the tready criteria for PBT and PB2. Given the typically noisy evaluation of policy gradient algorithms [24] we repeat each experiment with ten seeds.