Revisiting Peng’s Q($λ$) for Modern Reinforcement Learning

Authors: Tadashi Kozuno, Yunhao Tang, Mark Rowland, Remi Munos, Steven Kapturowski, Will Dabney, Michal Valko, David Abel

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Motivated by the empirical results and the lack of theory, we carry out theoretical analyses of Peng s Q(λ), a representative example of non-conservative algorithms. We prove that it also converges to an optimal policy provided that the behavior policy slowly tracks a greedy policy in a way similar to conservative policy iteration. Such a result has been conjectured to be true but has not been proven. We also experiment with Peng s Q(λ) in complex continuous control tasks, confirming that Peng s Q(λ) often outperforms conservative algorithms despite its simplicity. These results indicate that Peng s Q(λ), which was thought to be unsafe, is a theoretically-sound and practically effective algorithm.
Researcher Affiliation Collaboration 1Independent Researcher, Okayama, Japan (Now at the University of Alberta) 2Columbia University, NY, USA 3Deep Mind, London, UK 4Deep Mind, Paris, France.
Pseudocode No The paper describes algorithms using mathematical notation and text but does not include any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any statement about releasing source code or a link to a code repository for the methodology described.
Open Datasets No The paper refers to using environments like "Deep Mind (DM) control suite (Tassa et al., 2020) and an open sourced simulator Bullet physics (Coumans & Bai, 2016 2019)" for generating data, but it does not use or provide access information for a pre-collected, publicly available dataset in the traditional sense.
Dataset Splits No The paper describes training and evaluation in reinforcement learning environments but does not provide specific percentages or counts for training/validation/test dataset splits, as is common in supervised learning.
Hardware Specification No The acknowledgements mention a "cluster" maintained by OIST's Scientific Computation and Data Analysis section and "computational support from Google Cloud Platform," but no specific hardware models (e.g., GPU/CPU models, memory details) are provided for the experiments.
Software Dependencies No The paper mentions using "TD3 (Fujimoto et al., 2018)" as a base algorithm and refers to "Py Bullet" but does not specify version numbers for any software dependencies, libraries, or frameworks used in their implementation.
Experiment Setup Yes For the deep RL experiments, the paper states in Section 6.1 "See further details in Appendix J." and in Section 7.2 "All algorithms are trained with a fixed number of steps and results are averaged across 5 random seeds." Appendix J contains specific hyperparameters such as "hyperparameters are kept identical to those in Fujimoto et al. (2018) for the DDPG, TD3, and SAC agents" and mentions specific optimizer details like "We used Adam optimizer... learning rate 3e-4" and network architecture.