Randomized Ensembled Double Q-Learning: Learning Fast Without a Model

Authors: Xinyue Chen, Che Wang, Zijian Zhou, Keith W. Ross

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we introduce a simple model-free algorithm, Randomized Ensembled Double Q-Learning (REDQ), and show that its performance is just as good as, if not better than, a state-of-the-art modelbased algorithm for the Mu Jo Co benchmark. Moreover, REDQ can achieve this performance using fewer parameters than the model-based method, and with less wall-clock run time. REDQ has three carefully integrated ingredients which allow it to achieve its high performance: (i) a UTD ratio ≥ 1; (ii) an ensemble of Q functions; (iii) in-target minimization across a random subset of Q functions from the ensemble. Through carefully designed experiments, we provide a detailed analysis of REDQ and related model-free algorithms.
Researcher Affiliation Academia Xinyue Chen1 Che Wang1,2 Zijian Zhou1 Keith Ross1,2 1 New York University Shanghai 2 New York University
Pseudocode Yes The pseudocode for REDQ is shown in Algorithm 1. ... Algorithm 1 Randomized Ensembled Double Q-learning (REDQ) ... The complete pseudocode for tabular REDQ is provided in the Appendix. ... Algorithm 2 Tabular REDQ
Open Source Code Yes To ensure our comparisons are fair, and to ensure our results are reproducible (Henderson et al., 2018; Islam et al., 2017; Duan et al., 2016), we provide open source code1. ... 1Code and implementation tutorial can be found at: https://github.com/watchernyu/REDQ
Open Datasets Yes Compared to Soft-Actor-Critic (SAC), which is model-free and uses a UTD of 1, MBPO achieves much higher sample efficiency in the Open AI Mu Jo Co benchmark (Todorov et al., 2012; Brockman et al., 2016).
Dataset Splits No The paper describes how data is generated and used in a reinforcement learning setting (e.g., "replay buffer size 10^6", "random starting data 5000"). However, it does not specify explicit train/validation/test *splits* of a pre-existing static dataset, which is common in supervised learning. The evaluation protocol mentions running a "test episode" but does not define a validation split from collected data.
Hardware Specification Yes Additionally, we measured the runtime on a 2080-Ti GPU and found that MBPO roughly takes 75% longer.
Software Dependencies No The paper mentions using 'optimizer Adam' and notes 'SAC' as the underlying off-policy algorithm, but it does not specify version numbers for any software libraries, frameworks (like PyTorch, TensorFlow), or programming languages used (e.g., Python version) beyond citing general methods.
Experiment Setup Yes Table 1: REDQ hyperparameters. ... optimizer Adam, learning rate 3e-4, discount (γ) 0.99, target smoothing coefficient (ρ) 0.005, replay buffer size 10^6, number of hidden layers for all networks 2, number of hidden units per layer 256, mini-batch size 256, nonlinearity ReLU, random starting data 5000, ensemble size N 10, in-target minimization parameter M 2, UTD ratio G 20.