Variational Bayesian Reinforcement Learning with Regret Bounds
Authors: Brendan O'Donoghue
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section we compare the performance of both the temperature scheduled and optimized temperature variants of K-learning against several other methods in the literature. |
| Researcher Affiliation | Industry | Brendan O Donoghue Deep Mind, UK bodonoghue@google.com |
| Pseudocode | Yes | Algorithm 1 K-learning for episodic MDPs |
| Open Source Code | No | Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [No] |
| Open Datasets | Yes | We consider a small tabular MDP called Deep Sea [39] shown in Figure 1 |
| Dataset Splits | No | Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [N/A] These experiments involved no training on external data. |
| Hardware Specification | Yes | Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] Included in appendix. |
| Software Dependencies | Yes | SCS: Splitting conic solver, version 2.0.2. https://github.com/cvxgrp/scs, Nov. 2017. |
| Experiment Setup | Yes | We compare two dithering approaches, Q-learning with epsilongreedy (ϵ = 0.1) and soft-Q-learning [18] (τ = 0.05), against principled exploration strategies RLSVI [39], UCBVI [7], optimistic Q-learning (OQL) [23], BEB [24], Thompson sampling [38] and two variants of K-learning, one using the τt schedule (10) and the other using the optimal choice τ t from solving (11). |