reproducibilityindex.ai

UCB Momentum Q-learning: Correcting the bias without forgetting

Authors: Pierre Menard, Omar Darwiche Domingues, Xuedong Shang, Michal Valko

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we present a numerical simulation to illustrate the beneﬁts of not forgetting the targets in UCBMQ. We compare UCBMQ to the following baselines: (i) UCBVI (Azar et al., 2017); (ii) Opt QL (Jin et al., 2018), and (iii) Greedy-UCBVI, a version of UCBVI using real time dynamic programming (Efroni et al., 2019). We use a grid-world environment with 50 states (i, j) [10] [5] and 4 actions (left, right, up and down). In Figure 1, we observe that UCBMQ outperforms Opt QL in our experiments, whereas the only differences in the implementations of the two algorithms are the learning rates and the momentum term used by UCBMQ (since the bonuses were kept identical). This illustrates the potential gain in sample efﬁciency enabled by not forgetting the targets.
Researcher Affiliation	Collaboration	1Otto von Guericke University 2Inria 3Université de Lille 4Deep Mind Paris.
Pseudocode	Yes	Algorithm 1 UCBMQ
Open Source Code	Yes	The code to reproduce the experiments is available on Git Hub, and uses the rlberry library (Domingues et al., 2021a).
Open Datasets	Yes	We use a grid-world environment with 50 states (i, j) [10] [5] and 4 actions (left, right, up and down). When taking an action, the agent moves in the corresponding direction with probability 1 ε, and moves to a neighbor state at random with probability ε. The starting position is (1, 1). The reward equals to 1 at the state (10, 5) and is zero elsewhere.8 The code to reproduce the experiments is available on Git Hub, and uses the rlberry library (Domingues et al., 2021a).
Dataset Splits	No	No specific training/test/validation dataset splits are mentioned. The paper describes a reinforcement learning setting in a simulated grid-world environment, where learning occurs through interaction over episodes rather than fixed dataset splits.
Hardware Specification	No	No specific hardware details (e.g., GPU/CPU models, memory) used for running the experiments are provided.
Software Dependencies	No	The paper mentions the 'rlberry library (Domingues et al., 2021a)' but does not provide specific version numbers for any software dependencies.
Experiment Setup	Yes	In Figure 1, we observe that UCBMQ outperforms Opt QL in our experiments, whereas the only differences in the implementations of the two algorithms are the learning rates and the momentum term used by UCBMQ (since the bonuses were kept identical). This illustrates the potential gain in sample efﬁciency enabled by not forgetting the targets. We use a grid-world environment with 50 states (i, j) [10] [5] and 4 actions (left, right, up and down). When taking an action, the agent moves in the corresponding direction with probability 1 ε, and moves to a neighbor state at random with probability ε. The starting position is (1, 1). The reward equals to 1 at the state (10, 5) and is zero elsewhere. We use the same exploration bonus for all the algorithms, given by: βt h(s, a) = min 1 nt h(s, a) + H h + 1 nt h(s, a) , H h + 1 . In Figure 1, we observe that UCBMQ outperforms Opt QL in our experiments, for H = 100 and transition noise ε = 0.15.