reproducibilityindex.ai

Scalar Posterior Sampling with Applications

Authors: Georgios Theocharous, Zheng Wen, Yasin Abbasi Yadkori, Nikos Vlassis

NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section we compare through simulations the performance of DS-PSRL algorithm with the latest PSRL algorithm called Thompson Sampling with dynamic episodes (TSDE) Ouyang et al. [2017b]. We experimented with the River Swim environment Strehl and Littman [2008], which was the domain used to show how TSDE outperforms all known existing algorithms in Ouyang et al. [2017b].
Researcher Affiliation	Industry	Georgios Theocharous Adobe Research theochar@adobe.com Zheng Wen Adobe Research zwen@adobe.com Yasin Abbasi-Yadkori Adobe Research abbasiya@adobe.com Nikos Vlassis Netflix nvlassis@netflix.com
Pseudocode	Yes	Figure 1: The DS-PSRL algorithm with deterministic schedule of policy updates. Inputs: P1, the prior distribution of . L 1. for t 1, 2, . . . do if t = L then Sample e t Pt. L 2L. else e t e t 1. end if Calculate near-optimal action at (xt, e t). Execute action at and observe the new state xt+1. Update Pt with (xt, at, xt+1) to obtain Pt+1. end for
Open Source Code	No	The paper does not provide an explicit statement or link to open-source code for the methodology described.
Open Datasets	Yes	We experimented with the River Swim environment Strehl and Littman [2008], which was the domain used to show how TSDE outperforms all known existing algorithms in Ouyang et al. [2017b].
Dataset Splits	No	The paper mentions using the River Swim environment and various settings for experiments but does not explicitly specify training, validation, and test dataset splits or percentages.
Hardware Specification	No	The paper does not provide any specific hardware details such as GPU/CPU models or memory used for running the experiments.
Software Dependencies	No	The paper does not provide specific software names with version numbers that are required to reproduce the experiments.
Experiment Setup	Yes	The MDP consists of K states arranged in a chain with the agent starting in the leftmost state (s = 1). The reward function is given by: r(s, a) = 5 if s = 1, a = left; r(s, a) = 10000 if s = K, a = right; and r(s, a) = 0 otherwise. We assumed the true model of the world was = 2 and that the agent starts in the left-most state. The initial parameters of the priors were set to one (uniform) for the non-zero transition probabilities of the River Swim problem and zero otherwise. In our experiment we set n = 2 and d = 2.