UCB Momentum Q-learning: Correcting the bias without forgetting
Authors: Pierre Menard, Omar Darwiche Domingues, Xuedong Shang, Michal Valko
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we present a numerical simulation to illustrate the benefits of not forgetting the targets in UCBMQ. We compare UCBMQ to the following baselines: (i) UCBVI (Azar et al., 2017); (ii) Opt QL (Jin et al., 2018), and (iii) Greedy-UCBVI, a version of UCBVI using real time dynamic programming (Efroni et al., 2019). We use a grid-world environment with 50 states (i, j) [10] [5] and 4 actions (left, right, up and down). In Figure 1, we observe that UCBMQ outperforms Opt QL in our experiments, whereas the only differences in the implementations of the two algorithms are the learning rates and the momentum term used by UCBMQ (since the bonuses were kept identical). This illustrates the potential gain in sample efficiency enabled by not forgetting the targets. |
| Researcher Affiliation | Collaboration | 1Otto von Guericke University 2Inria 3Université de Lille 4Deep Mind Paris. |
| Pseudocode | Yes | Algorithm 1 UCBMQ |
| Open Source Code | Yes | The code to reproduce the experiments is available on Git Hub, and uses the rlberry library (Domingues et al., 2021a). |
| Open Datasets | Yes | We use a grid-world environment with 50 states (i, j) [10] [5] and 4 actions (left, right, up and down). When taking an action, the agent moves in the corresponding direction with probability 1 ε, and moves to a neighbor state at random with probability ε. The starting position is (1, 1). The reward equals to 1 at the state (10, 5) and is zero elsewhere.8 The code to reproduce the experiments is available on Git Hub, and uses the rlberry library (Domingues et al., 2021a). |
| Dataset Splits | No | No specific training/test/validation dataset splits are mentioned. The paper describes a reinforcement learning setting in a simulated grid-world environment, where learning occurs through interaction over episodes rather than fixed dataset splits. |
| Hardware Specification | No | No specific hardware details (e.g., GPU/CPU models, memory) used for running the experiments are provided. |
| Software Dependencies | No | The paper mentions the 'rlberry library (Domingues et al., 2021a)' but does not provide specific version numbers for any software dependencies. |
| Experiment Setup | Yes | In Figure 1, we observe that UCBMQ outperforms Opt QL in our experiments, whereas the only differences in the implementations of the two algorithms are the learning rates and the momentum term used by UCBMQ (since the bonuses were kept identical). This illustrates the potential gain in sample efficiency enabled by not forgetting the targets. We use a grid-world environment with 50 states (i, j) [10] [5] and 4 actions (left, right, up and down). When taking an action, the agent moves in the corresponding direction with probability 1 ε, and moves to a neighbor state at random with probability ε. The starting position is (1, 1). The reward equals to 1 at the state (10, 5) and is zero elsewhere. We use the same exploration bonus for all the algorithms, given by: βt h(s, a) = min 1 nt h(s, a) + H h + 1 nt h(s, a) , H h + 1 . In Figure 1, we observe that UCBMQ outperforms Opt QL in our experiments, for H = 100 and transition noise ε = 0.15. |