Scalar Posterior Sampling with Applications
Authors: Georgios Theocharous, Zheng Wen, Yasin Abbasi Yadkori, Nikos Vlassis
NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section we compare through simulations the performance of DS-PSRL algorithm with the latest PSRL algorithm called Thompson Sampling with dynamic episodes (TSDE) Ouyang et al. [2017b]. We experimented with the River Swim environment Strehl and Littman [2008], which was the domain used to show how TSDE outperforms all known existing algorithms in Ouyang et al. [2017b]. |
| Researcher Affiliation | Industry | Georgios Theocharous Adobe Research theochar@adobe.com Zheng Wen Adobe Research zwen@adobe.com Yasin Abbasi-Yadkori Adobe Research abbasiya@adobe.com Nikos Vlassis Netflix nvlassis@netflix.com |
| Pseudocode | Yes | Figure 1: The DS-PSRL algorithm with deterministic schedule of policy updates. Inputs: P1, the prior distribution of . L 1. for t 1, 2, . . . do if t = L then Sample e t Pt. L 2L. else e t e t 1. end if Calculate near-optimal action at (xt, e t). Execute action at and observe the new state xt+1. Update Pt with (xt, at, xt+1) to obtain Pt+1. end for |
| Open Source Code | No | The paper does not provide an explicit statement or link to open-source code for the methodology described. |
| Open Datasets | Yes | We experimented with the River Swim environment Strehl and Littman [2008], which was the domain used to show how TSDE outperforms all known existing algorithms in Ouyang et al. [2017b]. |
| Dataset Splits | No | The paper mentions using the River Swim environment and various settings for experiments but does not explicitly specify training, validation, and test dataset splits or percentages. |
| Hardware Specification | No | The paper does not provide any specific hardware details such as GPU/CPU models or memory used for running the experiments. |
| Software Dependencies | No | The paper does not provide specific software names with version numbers that are required to reproduce the experiments. |
| Experiment Setup | Yes | The MDP consists of K states arranged in a chain with the agent starting in the leftmost state (s = 1). The reward function is given by: r(s, a) = 5 if s = 1, a = left; r(s, a) = 10000 if s = K, a = right; and r(s, a) = 0 otherwise. We assumed the true model of the world was = 2 and that the agent starts in the left-most state. The initial parameters of the priors were set to one (uniform) for the non-zero transition probabilities of the River Swim problem and zero otherwise. In our experiment we set n = 2 and d = 2. |