VA-learning as a more efficient alternative to Q-learning

Authors: Yunhao Tang, Remi Munos, Mark Rowland, Michal Valko

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We start with experiments on tabular MDPs, to understand the improved sample efficiency of VA-learning over Q-learning. Then we evaluate the impacts of VA-learning and behavior dueling in deep RL settings.
Researcher Affiliation Industry 1Google Deep Mind. Correspondence to: Yunhao Tang <robintyh@deepmind.com>.
Pseudocode Yes Algorithm 1 Tabular VA-learning
Open Source Code No The paper mentions that their agents are based on a reference implementation from 'DQN Zoo: Reference implementations of DQN-based agents. URL http://github.com/deepmind/dqn_zoo.' However, it does not provide a specific link or statement about the open-sourcing of their own VA-learning or behavior dueling implementations.
Open Datasets Yes All the deep RL experiments use the DQN agent (Mnih et al., 2013) as the baseline agent and use the Atari 57 game suite as the test bed (Bellemare et al., 2013).
Dataset Splits No The paper evaluates performance across 200M training frames and uses multiple random seeds, but it does not provide specific train/validation/test dataset split percentages or counts.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments.
Software Dependencies No The paper mentions using the RMSProp optimizer and refers to DQN and other algorithms, but it does not list specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes We use the standard DQN architecture specified in (Mnih et al., 2013)... Throughout, we use n = 5... All agents use the RMSProp optimizer... By default, one-step DQN agent uses the learning rate β = 2.5 10 4. When using n-step Q-learning with n = 5, we find the learning rate is best set smaller to be at 5 10 5. When doing VA-learning, behavior dueling and uniform dueling, we find it improves performance further by reducing the learning rate more, to 1.5 10 5. and "optimizing the Huber loss is a more robust alternative (Quan and Ostrovski) huber(x) = x2I [|x| τ] + |x| I [|x| > τ] , where by default τ = 1."