reproducibilityindex.ai

VA-learning as a more efficient alternative to Q-learning

Authors: Yunhao Tang, Remi Munos, Mark Rowland, Michal Valko

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We start with experiments on tabular MDPs, to understand the improved sample efficiency of VA-learning over Q-learning. Then we evaluate the impacts of VA-learning and behavior dueling in deep RL settings.
Researcher Affiliation	Industry	1Google Deep Mind. Correspondence to: Yunhao Tang <robintyh@deepmind.com>.
Pseudocode	Yes	Algorithm 1 Tabular VA-learning
Open Source Code	No	The paper mentions that their agents are based on a reference implementation from 'DQN Zoo: Reference implementations of DQN-based agents. URL http://github.com/deepmind/dqn_zoo.' However, it does not provide a specific link or statement about the open-sourcing of their own VA-learning or behavior dueling implementations.
Open Datasets	Yes	All the deep RL experiments use the DQN agent (Mnih et al., 2013) as the baseline agent and use the Atari 57 game suite as the test bed (Bellemare et al., 2013).
Dataset Splits	No	The paper evaluates performance across 200M training frames and uses multiple random seeds, but it does not provide specific train/validation/test dataset split percentages or counts.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments.
Software Dependencies	No	The paper mentions using the RMSProp optimizer and refers to DQN and other algorithms, but it does not list specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	Yes	We use the standard DQN architecture specified in (Mnih et al., 2013)... Throughout, we use n = 5... All agents use the RMSProp optimizer... By default, one-step DQN agent uses the learning rate β = 2.5 10 4. When using n-step Q-learning with n = 5, we find the learning rate is best set smaller to be at 5 10 5. When doing VA-learning, behavior dueling and uniform dueling, we find it improves performance further by reducing the learning rate more, to 1.5 10 5. and "optimizing the Huber loss is a more robust alternative (Quan and Ostrovski) huber(x) = x2I [\|x\| τ] + \|x\| I [\|x\| > τ] , where by default τ = 1."