UCB Momentum Q-learning: Correcting the bias without forgetting

Authors: Pierre Menard, Omar Darwiche Domingues, Xuedong Shang, Michal Valko

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we present a numerical simulation to illustrate the benefits of not forgetting the targets in UCBMQ. We compare UCBMQ to the following baselines: (i) UCBVI (Azar et al., 2017); (ii) Opt QL (Jin et al., 2018), and (iii) Greedy-UCBVI, a version of UCBVI using real time dynamic programming (Efroni et al., 2019). We use a grid-world environment with 50 states (i, j) [10] [5] and 4 actions (left, right, up and down). In Figure 1, we observe that UCBMQ outperforms Opt QL in our experiments, whereas the only differences in the implementations of the two algorithms are the learning rates and the momentum term used by UCBMQ (since the bonuses were kept identical). This illustrates the potential gain in sample efficiency enabled by not forgetting the targets.
Researcher Affiliation Collaboration 1Otto von Guericke University 2Inria 3Université de Lille 4Deep Mind Paris.
Pseudocode Yes Algorithm 1 UCBMQ
Open Source Code Yes The code to reproduce the experiments is available on Git Hub, and uses the rlberry library (Domingues et al., 2021a).
Open Datasets Yes We use a grid-world environment with 50 states (i, j) [10] [5] and 4 actions (left, right, up and down). When taking an action, the agent moves in the corresponding direction with probability 1 ε, and moves to a neighbor state at random with probability ε. The starting position is (1, 1). The reward equals to 1 at the state (10, 5) and is zero elsewhere.8 The code to reproduce the experiments is available on Git Hub, and uses the rlberry library (Domingues et al., 2021a).
Dataset Splits No No specific training/test/validation dataset splits are mentioned. The paper describes a reinforcement learning setting in a simulated grid-world environment, where learning occurs through interaction over episodes rather than fixed dataset splits.
Hardware Specification No No specific hardware details (e.g., GPU/CPU models, memory) used for running the experiments are provided.
Software Dependencies No The paper mentions the 'rlberry library (Domingues et al., 2021a)' but does not provide specific version numbers for any software dependencies.
Experiment Setup Yes In Figure 1, we observe that UCBMQ outperforms Opt QL in our experiments, whereas the only differences in the implementations of the two algorithms are the learning rates and the momentum term used by UCBMQ (since the bonuses were kept identical). This illustrates the potential gain in sample efficiency enabled by not forgetting the targets. We use a grid-world environment with 50 states (i, j) [10] [5] and 4 actions (left, right, up and down). When taking an action, the agent moves in the corresponding direction with probability 1 ε, and moves to a neighbor state at random with probability ε. The starting position is (1, 1). The reward equals to 1 at the state (10, 5) and is zero elsewhere. We use the same exploration bonus for all the algorithms, given by: βt h(s, a) = min 1 nt h(s, a) + H h + 1 nt h(s, a) , H h + 1 . In Figure 1, we observe that UCBMQ outperforms Opt QL in our experiments, for H = 100 and transition noise ε = 0.15.