Revisiting Peng’s Q($λ$) for Modern Reinforcement Learning
Authors: Tadashi Kozuno, Yunhao Tang, Mark Rowland, Remi Munos, Steven Kapturowski, Will Dabney, Michal Valko, David Abel
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Motivated by the empirical results and the lack of theory, we carry out theoretical analyses of Peng s Q(λ), a representative example of non-conservative algorithms. We prove that it also converges to an optimal policy provided that the behavior policy slowly tracks a greedy policy in a way similar to conservative policy iteration. Such a result has been conjectured to be true but has not been proven. We also experiment with Peng s Q(λ) in complex continuous control tasks, confirming that Peng s Q(λ) often outperforms conservative algorithms despite its simplicity. These results indicate that Peng s Q(λ), which was thought to be unsafe, is a theoretically-sound and practically effective algorithm. |
| Researcher Affiliation | Collaboration | 1Independent Researcher, Okayama, Japan (Now at the University of Alberta) 2Columbia University, NY, USA 3Deep Mind, London, UK 4Deep Mind, Paris, France. |
| Pseudocode | No | The paper describes algorithms using mathematical notation and text but does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any statement about releasing source code or a link to a code repository for the methodology described. |
| Open Datasets | No | The paper refers to using environments like "Deep Mind (DM) control suite (Tassa et al., 2020) and an open sourced simulator Bullet physics (Coumans & Bai, 2016 2019)" for generating data, but it does not use or provide access information for a pre-collected, publicly available dataset in the traditional sense. |
| Dataset Splits | No | The paper describes training and evaluation in reinforcement learning environments but does not provide specific percentages or counts for training/validation/test dataset splits, as is common in supervised learning. |
| Hardware Specification | No | The acknowledgements mention a "cluster" maintained by OIST's Scientific Computation and Data Analysis section and "computational support from Google Cloud Platform," but no specific hardware models (e.g., GPU/CPU models, memory details) are provided for the experiments. |
| Software Dependencies | No | The paper mentions using "TD3 (Fujimoto et al., 2018)" as a base algorithm and refers to "Py Bullet" but does not specify version numbers for any software dependencies, libraries, or frameworks used in their implementation. |
| Experiment Setup | Yes | For the deep RL experiments, the paper states in Section 6.1 "See further details in Appendix J." and in Section 7.2 "All algorithms are trained with a fixed number of steps and results are averaged across 5 random seeds." Appendix J contains specific hyperparameters such as "hyperparameters are kept identical to those in Fujimoto et al. (2018) for the DDPG, TD3, and SAC agents" and mentions specific optimizer details like "We used Adam optimizer... learning rate 3e-4" and network architecture. |