Scalar Posterior Sampling with Applications

Authors: Georgios Theocharous, Zheng Wen, Yasin Abbasi Yadkori, Nikos Vlassis

NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section we compare through simulations the performance of DS-PSRL algorithm with the latest PSRL algorithm called Thompson Sampling with dynamic episodes (TSDE) Ouyang et al. [2017b]. We experimented with the River Swim environment Strehl and Littman [2008], which was the domain used to show how TSDE outperforms all known existing algorithms in Ouyang et al. [2017b].
Researcher Affiliation Industry Georgios Theocharous Adobe Research theochar@adobe.com Zheng Wen Adobe Research zwen@adobe.com Yasin Abbasi-Yadkori Adobe Research abbasiya@adobe.com Nikos Vlassis Netflix nvlassis@netflix.com
Pseudocode Yes Figure 1: The DS-PSRL algorithm with deterministic schedule of policy updates. Inputs: P1, the prior distribution of . L 1. for t 1, 2, . . . do if t = L then Sample e t Pt. L 2L. else e t e t 1. end if Calculate near-optimal action at (xt, e t). Execute action at and observe the new state xt+1. Update Pt with (xt, at, xt+1) to obtain Pt+1. end for
Open Source Code No The paper does not provide an explicit statement or link to open-source code for the methodology described.
Open Datasets Yes We experimented with the River Swim environment Strehl and Littman [2008], which was the domain used to show how TSDE outperforms all known existing algorithms in Ouyang et al. [2017b].
Dataset Splits No The paper mentions using the River Swim environment and various settings for experiments but does not explicitly specify training, validation, and test dataset splits or percentages.
Hardware Specification No The paper does not provide any specific hardware details such as GPU/CPU models or memory used for running the experiments.
Software Dependencies No The paper does not provide specific software names with version numbers that are required to reproduce the experiments.
Experiment Setup Yes The MDP consists of K states arranged in a chain with the agent starting in the leftmost state (s = 1). The reward function is given by: r(s, a) = 5 if s = 1, a = left; r(s, a) = 10000 if s = K, a = right; and r(s, a) = 0 otherwise. We assumed the true model of the world was = 2 and that the agent starts in the left-most state. The initial parameters of the priors were set to one (uniform) for the non-zero transition probabilities of the River Swim problem and zero otherwise. In our experiment we set n = 2 and d = 2.