Randomized Ensembled Double Q-Learning: Learning Fast Without a Model
Authors: Xinyue Chen, Che Wang, Zijian Zhou, Keith W. Ross
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we introduce a simple model-free algorithm, Randomized Ensembled Double Q-Learning (REDQ), and show that its performance is just as good as, if not better than, a state-of-the-art modelbased algorithm for the Mu Jo Co benchmark. Moreover, REDQ can achieve this performance using fewer parameters than the model-based method, and with less wall-clock run time. REDQ has three carefully integrated ingredients which allow it to achieve its high performance: (i) a UTD ratio ≥ 1; (ii) an ensemble of Q functions; (iii) in-target minimization across a random subset of Q functions from the ensemble. Through carefully designed experiments, we provide a detailed analysis of REDQ and related model-free algorithms. |
| Researcher Affiliation | Academia | Xinyue Chen1 Che Wang1,2 Zijian Zhou1 Keith Ross1,2 1 New York University Shanghai 2 New York University |
| Pseudocode | Yes | The pseudocode for REDQ is shown in Algorithm 1. ... Algorithm 1 Randomized Ensembled Double Q-learning (REDQ) ... The complete pseudocode for tabular REDQ is provided in the Appendix. ... Algorithm 2 Tabular REDQ |
| Open Source Code | Yes | To ensure our comparisons are fair, and to ensure our results are reproducible (Henderson et al., 2018; Islam et al., 2017; Duan et al., 2016), we provide open source code1. ... 1Code and implementation tutorial can be found at: https://github.com/watchernyu/REDQ |
| Open Datasets | Yes | Compared to Soft-Actor-Critic (SAC), which is model-free and uses a UTD of 1, MBPO achieves much higher sample efficiency in the Open AI Mu Jo Co benchmark (Todorov et al., 2012; Brockman et al., 2016). |
| Dataset Splits | No | The paper describes how data is generated and used in a reinforcement learning setting (e.g., "replay buffer size 10^6", "random starting data 5000"). However, it does not specify explicit train/validation/test *splits* of a pre-existing static dataset, which is common in supervised learning. The evaluation protocol mentions running a "test episode" but does not define a validation split from collected data. |
| Hardware Specification | Yes | Additionally, we measured the runtime on a 2080-Ti GPU and found that MBPO roughly takes 75% longer. |
| Software Dependencies | No | The paper mentions using 'optimizer Adam' and notes 'SAC' as the underlying off-policy algorithm, but it does not specify version numbers for any software libraries, frameworks (like PyTorch, TensorFlow), or programming languages used (e.g., Python version) beyond citing general methods. |
| Experiment Setup | Yes | Table 1: REDQ hyperparameters. ... optimizer Adam, learning rate 3e-4, discount (γ) 0.99, target smoothing coefficient (ρ) 0.005, replay buffer size 10^6, number of hidden layers for all networks 2, number of hidden units per layer 256, mini-batch size 256, nonlinearity ReLU, random starting data 5000, ensemble size N 10, in-target minimization parameter M 2, UTD ratio G 20. |