Evolving Reinforcement Learning Algorithms

Authors: John D Co-Reyes, Yingjie Miao, Daiyi Peng, Esteban Real, Quoc V Le, Sergey Levine, Honglak Lee, Aleksandra Faust

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We compare starting from scratch with bootstrapping off existing algorithms and find that while starting from scratch can learn existing algorithms, starting from existing knowledge leads to new RL algorithms which can outperform the initial programs. We learn two new RL algorithms which outperform existing algorithms in both sample efficiency and final performance on the training and test environments. The learned algorithms are domain agnostic and generalize to new environments. Importantly, the training environments consist of a suite of discrete action classical control tasks and gridworld style environments while the test environments include Atari games and are unlike anything seen during training. We discuss the training setup and results of our experiments.
Researcher Affiliation Collaboration John D. Co-Reyes, Yingjie Miao, Daiyi Peng, Esteban Real, Sergey Levine, Quoc V. Le, Honglak Lee, Aleksandra Faust Research at Google, Mountain View, CA 94043, USA jcoreyes@eecs.berkeley.edu, {yingjiemiao,daiyip,ereal,slevine,qvl,honglak,sandrafaust}@google.com
Pseudocode Yes Algorithm 1 Algorithm Evaluation, Eval(L, E) and Algorithm 2 Evolving RL Algorithms
Open Source Code Yes We provide a full list of top performing algorithms from a few of our experiments at https://github.com/jcoreyes/evolvingrl.
Open Datasets Yes We use a range of 4 classical control tasks (Cart Pole, Acrobat, Mountain Car, Lunar Lander) and a set of 12 multitask gridworld style environments from Mini Grid (Chevalier-Boisvert et al., 2018). Minimalistic gridworld environment for openai gym. https://github.com/maximecb/gym-minigrid, 2018.
Dataset Splits No The paper describes training and test environments but does not explicitly define or refer to a distinct validation set split (e.g., in terms of percentages or counts) for hyperparameter tuning or model selection.
Hardware Specification No The paper states, "The search is done over 300 CPUs and run for roughly 72 hours," but does not specify the model or type of CPUs, nor any other hardware components like GPUs or memory details.
Software Dependencies No The paper mentions using "Adam optimizer" and "Re LU activations" but does not provide specific version numbers for any software dependencies, libraries, or frameworks (e.g., Python version, TensorFlow/PyTorch version, CUDA version).
Experiment Setup Yes For training the RL agent, we use the same hyperparameters across all training and test environments except as noted. All neural networks are MLPs of size (256, 256) with Re LU activations. We use the Adam optimizer with a learning rate of 0.0001. ϵ is decayed linearly from 1 to 0.05 over 1e3 steps for the classical control tasks and over 1e5 steps for the Mini Grid tasks. Target update period is 100.