Variational Bayesian Reinforcement Learning with Regret Bounds

Authors: Brendan O'Donoghue

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section we compare the performance of both the temperature scheduled and optimized temperature variants of K-learning against several other methods in the literature.
Researcher Affiliation Industry Brendan O Donoghue Deep Mind, UK bodonoghue@google.com
Pseudocode Yes Algorithm 1 K-learning for episodic MDPs
Open Source Code No Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [No]
Open Datasets Yes We consider a small tabular MDP called Deep Sea [39] shown in Figure 1
Dataset Splits No Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [N/A] These experiments involved no training on external data.
Hardware Specification Yes Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] Included in appendix.
Software Dependencies Yes SCS: Splitting conic solver, version 2.0.2. https://github.com/cvxgrp/scs, Nov. 2017.
Experiment Setup Yes We compare two dithering approaches, Q-learning with epsilongreedy (ϵ = 0.1) and soft-Q-learning [18] (τ = 0.05), against principled exploration strategies RLSVI [39], UCBVI [7], optimistic Q-learning (OQL) [23], BEB [24], Thompson sampling [38] and two variants of K-learning, one using the τt schedule (10) and the other using the optimal choice τ t from solving (11).