Dropout Q-Functions for Doubly Efficient Reinforcement Learning

Authors: Takuya Hiraoka, Takahisa Imagawa, Taisei Hashimoto, Takashi Onishi, Yoshimasa Tsuruoka

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Despite its simplicity of implementation, our experimental results indicate that Dro Q is doubly (sample and computationally) efficient. It achieved comparable sample efficiency with REDQ, much better computational efficiency than REDQ, and comparable computational efficiency with that of SAC.
Researcher Affiliation Collaboration Takuya Hiraoka 1,2, Takahisa Imagawa 2, Taisei Hashimoto 2,3, Takashi Onishi 1,2, Yoshimasa Tsuruoka 2,3 1 NEC Corporation 2 National Institute of Advanced Industrial Science and Technology 3 The University of Tokyo
Pseudocode Yes Algorithm 1 REDQ" and "Algorithm 2 Dro Q
Open Source Code Yes Our source code is available at https://github.com/Takuya Hiraoka/ Dropout-Q-Functions-for-Doubly-Efficient-Reinforcement-Learning
Open Datasets Yes To evaluate the performances of Dro Q, we compared Dro Q with three baseline methods in Mu Jo Co benchmark environments (Todorov et al., 2012; Brockman et al., 2016). Following Chen et al. (2021b); Janner et al. (2019), we prepared the following environments: Hopper, Walker2d, Ant, and Humanoid.
Dataset Splits No The paper describes running 'ten test episodes with the current policy and recorded the average return' after every epoch, which serves as an evaluation during training. However, it does not specify explicit train/validation/test dataset splits in the traditional supervised learning sense, as data is generated through environment interactions.
Hardware Specification Yes For evaluation, we ran each method on a machine equipped with two Intel(R) Xeon(R) CPU E5-2667 v4 and one NVIDIA Tesla K80.
Software Dependencies No The paper mentions using 'Adam' as an optimizer and references the PyTorch profiler, but it does not specify version numbers for key software components like Python, PyTorch, or CUDA that would be needed for replication.
Experiment Setup Yes The hyperparameter settings for each method in the experiments discussed in Section 4 are listed in Table 8. Parameter values, except for (i) dropout rate for Dro Q and DUVN and (ii) M for DUVN, were set according to Chen et al. (2021b). The dropout rate (i) was set through line search, and M for DUVN (ii) was set according to Harrigan (2016); Moerland et al. (2017)." and also "Table 8: Hyperparameter settings Method Parameter Value SAC, REDQ, Dro Q, and DUVN optimizer Adam (Kingma & Ba, 2015) learning rate 3e-2 discount rate (γ) 0.99 target-smoothing coefficient (ρ) 0.005 replay buffer size 10^6 number of hidden layers for all networks 2 number of hidden units per layer 256 mini-batch size 256 random starting data 5000 UTD ratio G 20 REDQ and Dro Q in-target minimization parameter M 2 REDQ ensemble size N 10 Dro Q and DUVN dropout rate 0.01 DUVN in-target minimization parameter M 1