VA-learning as a more efficient alternative to Q-learning
Authors: Yunhao Tang, Remi Munos, Mark Rowland, Michal Valko
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We start with experiments on tabular MDPs, to understand the improved sample efficiency of VA-learning over Q-learning. Then we evaluate the impacts of VA-learning and behavior dueling in deep RL settings. |
| Researcher Affiliation | Industry | 1Google Deep Mind. Correspondence to: Yunhao Tang <robintyh@deepmind.com>. |
| Pseudocode | Yes | Algorithm 1 Tabular VA-learning |
| Open Source Code | No | The paper mentions that their agents are based on a reference implementation from 'DQN Zoo: Reference implementations of DQN-based agents. URL http://github.com/deepmind/dqn_zoo.' However, it does not provide a specific link or statement about the open-sourcing of their own VA-learning or behavior dueling implementations. |
| Open Datasets | Yes | All the deep RL experiments use the DQN agent (Mnih et al., 2013) as the baseline agent and use the Atari 57 game suite as the test bed (Bellemare et al., 2013). |
| Dataset Splits | No | The paper evaluates performance across 200M training frames and uses multiple random seeds, but it does not provide specific train/validation/test dataset split percentages or counts. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments. |
| Software Dependencies | No | The paper mentions using the RMSProp optimizer and refers to DQN and other algorithms, but it does not list specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | We use the standard DQN architecture specified in (Mnih et al., 2013)... Throughout, we use n = 5... All agents use the RMSProp optimizer... By default, one-step DQN agent uses the learning rate β = 2.5 10 4. When using n-step Q-learning with n = 5, we find the learning rate is best set smaller to be at 5 10 5. When doing VA-learning, behavior dueling and uniform dueling, we find it improves performance further by reducing the learning rate more, to 1.5 10 5. and "optimizing the Huber loss is a more robust alternative (Quan and Ostrovski) huber(x) = x2I [|x| τ] + |x| I [|x| > τ] , where by default τ = 1." |