reproducibilityindex.ai

Sample-Efficient Automated Deep Reinforcement Learning

Authors: Jörg K.H. Franke, Gregor Koehler, André Biedenkapp, Frank Hutter

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate the capabilities of our sample-efﬁcient Auto RL approach in a case study with the popular TD3 algorithm in the Mu Jo Co benchmark suite, where we reduce the number of environment interactions needed for meta-optimization by up to an order of magnitude compared to population-based training. In a case study optimizing the popular TD3 algorithm (Fujimoto et al., 2018) in the Mu Jo Co benchmark suite we demonstrate the beneﬁts of our framework and provide extensive ablation and analytic experiments. SEARL outperforms random search and modiﬁed PBT in terms of environment interactions with up to 10 faster convergence in each task, which indicates a signiﬁcant gain in the sample efﬁciency of our approach.
Researcher Affiliation	Collaboration	Jörg K.H. Franke1, Gregor Köhler2, André Biedenkapp1 & Frank Hutter1,3 1Department of Computer Science, University of Freiburg, Germany 2German Cancer Research Center, Heidelberg, Germany 3Bosch Center for Artiﬁcial Intelligence, Renningen, Germany frankej@cs.uni-freiburg.de
Pseudocode	Yes	We refer the reader to Appendix B for detailed pseudocode of the algorithm. B SEARL ALGORITHM Algorithm 1: SEARL algorithm
Open Source Code	Yes	We provide an open-source implementation of SEARL.1 Our contributions are: 1Please ﬁnd the source code on Git Hub: github.com/automl/SEARL
Open Datasets	Yes	We simulate the metaoptimization of the Twin Delayed Deep Deterministic Policy Gradient algorithm (TD3; Fujimoto et al., 2018) on the widely used Mu Jo Co continuous control benchmark suite (Todorov et al., 2012). We also demonstrate the generalization capabilities of our approach by meta-optimizing the established DQN (Mnih et al., 2015) algorithm for the Atari benchmark. We use SEARL to automatically tune the widely used Rainbow DQN algorithm (Hessel et al., 2018) on ﬁve different environments of the Arcade Learning Environment (Bellemare et al., 2013).
Dataset Splits	No	We select the conﬁguration that resulted in the best training performance and evaluate it with 10 different random seeds to obtain its validation performance. This statement refers to evaluating the best configuration found by random search, not a specific validation dataset split used during model training (e.g., a percentage of data set aside for validation during hyperparameter tuning of a single model). The paper does not provide explicit dataset split percentages or sample counts for validation.
Hardware Specification	No	The paper discusses computational costs and distributed setups but does not specify any particular hardware components like GPU or CPU models, or specific machine configurations used for experiments.
Software Dependencies	No	The paper mentions algorithms such as TD3, DQN, Rainbow DQN, and the Adam optimizer, but does not provide specific version numbers for any software libraries or dependencies (e.g., PyTorch, TensorFlow versions).
Experiment Setup	Yes	Table 2 shows the conﬁguration of SEARL for the TD3 case study. Parameter Value Max. frames in environment 2 000 000 Replay memory size 1 000 000 Min. frames per evaluation 250 Population size 20 Selection tournament size 3 New layer probability 20% New nodes probability 80% Parameter noise standard deviation 0.1 Training batch size 100 Fraction of eval frames for training 0.5 Default learning rate 0.001 Optimizer Adam TD3 Gamma 0.99 TD3 Tau 0.005 TD3 policy noise 0.2 TD3 noise clip 0.5 TD3 update frequency 2 Default activation function relu Start network size [128]