Sample-Efficient Automated Deep Reinforcement Learning

Authors: Jörg K.H. Franke, Gregor Koehler, André Biedenkapp, Frank Hutter

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the capabilities of our sample-efficient Auto RL approach in a case study with the popular TD3 algorithm in the Mu Jo Co benchmark suite, where we reduce the number of environment interactions needed for meta-optimization by up to an order of magnitude compared to population-based training. In a case study optimizing the popular TD3 algorithm (Fujimoto et al., 2018) in the Mu Jo Co benchmark suite we demonstrate the benefits of our framework and provide extensive ablation and analytic experiments. SEARL outperforms random search and modified PBT in terms of environment interactions with up to 10 faster convergence in each task, which indicates a significant gain in the sample efficiency of our approach.
Researcher Affiliation Collaboration Jörg K.H. Franke1, Gregor Köhler2, André Biedenkapp1 & Frank Hutter1,3 1Department of Computer Science, University of Freiburg, Germany 2German Cancer Research Center, Heidelberg, Germany 3Bosch Center for Artificial Intelligence, Renningen, Germany frankej@cs.uni-freiburg.de
Pseudocode Yes We refer the reader to Appendix B for detailed pseudocode of the algorithm. B SEARL ALGORITHM Algorithm 1: SEARL algorithm
Open Source Code Yes We provide an open-source implementation of SEARL.1 Our contributions are: 1Please find the source code on Git Hub: github.com/automl/SEARL
Open Datasets Yes We simulate the metaoptimization of the Twin Delayed Deep Deterministic Policy Gradient algorithm (TD3; Fujimoto et al., 2018) on the widely used Mu Jo Co continuous control benchmark suite (Todorov et al., 2012). We also demonstrate the generalization capabilities of our approach by meta-optimizing the established DQN (Mnih et al., 2015) algorithm for the Atari benchmark. We use SEARL to automatically tune the widely used Rainbow DQN algorithm (Hessel et al., 2018) on five different environments of the Arcade Learning Environment (Bellemare et al., 2013).
Dataset Splits No We select the configuration that resulted in the best training performance and evaluate it with 10 different random seeds to obtain its validation performance. This statement refers to evaluating the best configuration found by random search, not a specific validation dataset split used during model training (e.g., a percentage of data set aside for validation during hyperparameter tuning of a single model). The paper does not provide explicit dataset split percentages or sample counts for validation.
Hardware Specification No The paper discusses computational costs and distributed setups but does not specify any particular hardware components like GPU or CPU models, or specific machine configurations used for experiments.
Software Dependencies No The paper mentions algorithms such as TD3, DQN, Rainbow DQN, and the Adam optimizer, but does not provide specific version numbers for any software libraries or dependencies (e.g., PyTorch, TensorFlow versions).
Experiment Setup Yes Table 2 shows the configuration of SEARL for the TD3 case study. Parameter Value Max. frames in environment 2 000 000 Replay memory size 1 000 000 Min. frames per evaluation 250 Population size 20 Selection tournament size 3 New layer probability 20% New nodes probability 80% Parameter noise standard deviation 0.1 Training batch size 100 Fraction of eval frames for training 0.5 Default learning rate 0.001 Optimizer Adam TD3 Gamma 0.99 TD3 Tau 0.005 TD3 policy noise 0.2 TD3 noise clip 0.5 TD3 update frequency 2 Default activation function relu Start network size [128]