reproducibilityindex.ai

Variational Bayesian Reinforcement Learning with Regret Bounds

Authors: Brendan O'Donoghue

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section we compare the performance of both the temperature scheduled and optimized temperature variants of K-learning against several other methods in the literature.
Researcher Affiliation	Industry	Brendan O Donoghue Deep Mind, UK bodonoghue@google.com
Pseudocode	Yes	Algorithm 1 K-learning for episodic MDPs
Open Source Code	No	Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [No]
Open Datasets	Yes	We consider a small tabular MDP called Deep Sea [39] shown in Figure 1
Dataset Splits	No	Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [N/A] These experiments involved no training on external data.
Hardware Specification	Yes	Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] Included in appendix.
Software Dependencies	Yes	SCS: Splitting conic solver, version 2.0.2. https://github.com/cvxgrp/scs, Nov. 2017.
Experiment Setup	Yes	We compare two dithering approaches, Q-learning with epsilongreedy (ϵ = 0.1) and soft-Q-learning [18] (τ = 0.05), against principled exploration strategies RLSVI [39], UCBVI [7], optimistic Q-learning (OQL) [23], BEB [24], Thompson sampling [38] and two variants of K-learning, one using the τt schedule (10) and the other using the optimal choice τ t from solving (11).