From Dirichlet to Rubin: Optimistic Exploration in RL without Bonuses
Authors: Daniil Tiapkin, Denis Belomestny, Eric Moulines, Alexey Naumov, Sergey Samsonov, Yunhao Tang, Michal Valko, Pierre Menard
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section we provide experiments on Bayes-UCBVI and its variants. We illustrate two points: First, that Incr-Bayes-UCBVI performs similarly as other algorithms relying on noise-injection for exploration such that PSRL and RLSVI. Second, that Bayes-UCBDQN, the deep RL extension of Bayes-UCBVI is competitive with Boot DQN. |
| Researcher Affiliation | Collaboration | 1HSE University 2Artificial Intelligence Research Institute 3Duisburg-Essen University 4École Polytechnique 5Deep Mind 6Otto von Guericke University. |
| Pseudocode | Yes | Algorithm 1 Bayes-UCBVI |
| Open Source Code | No | No explicit statement about the release of the code for the methodology described in this paper is provided, nor is a direct link to a code repository. |
| Open Datasets | Yes | The testing environments are Atari-57 games, consisting of 57 selected Atari games (Bellemare et al., 2013). |
| Dataset Splits | No | The paper does not explicitly provide details about specific training, validation, and test dataset splits with percentages or sample counts. For deep RL, it mentions 'trained with 200M frames' but not how data was partitioned for validation or testing. |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware used (e.g., GPU/CPU models, memory) to run its experiments. It mentions 'All algorithms are based on the architecture of DQN (Mnih et al., 2013)' but this refers to the software architecture. |
| Software Dependencies | No | The paper mentions software components like 'DQN (Mnih et al., 2013)' and 'RMSProp optimizer (Hinton et al., 2012)' but does not provide specific version numbers for these or other key software dependencies, which are necessary for reproducibility. |
| Experiment Setup | Yes | All algorithms are trained with 200M frames for each environment. The networks are trained with RMSProp optimizer (Hinton et al., 2012) with learning rate 2.5 10 4. ... The quantile is set at κ = 0.85 in our experiments. ... Bayes-UCBDQN maintains B = 10 copies of the Q-functions... |