Learning Unknown Markov Decision Processes: A Thompson Sampling Approach
Authors: Yi Ouyang, Mukul Gagrani, Ashutosh Nayyar, Rahul Jain
NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Numerical results show it to perform better than existing algorithms for infinite horizon MDPs. |
| Researcher Affiliation | Academia | Yi Ouyang University of California, Berkeley ouyangyi@berkeley.edu Mukul Gagrani University of Southern California mgagrani@usc.edu Ashutosh Nayyar University of Southern California ashutosn@usc.edu Rahul Jain University of Southern California rahul.jain@usc.edu |
| Pseudocode | Yes | Algorithm 1 Thompson Sampling with Dynamic Episodes (TSDE) |
| Open Source Code | No | The paper does not provide any statement or link indicating that the source code for the described methodology is publicly available. |
| Open Datasets | Yes | We consider two environments: randomly generated MDPs and the River Swim example [22]. [22] A. L. Strehl and M. L. Littman, An analysis of model-based interval estimation for markov decision processes, Journal of Computer and System Sciences, vol. 74, no. 8, pp. 1309 1331, 2008. |
| Dataset Splits | No | The paper mentions using 'randomly generated MDPs' and the 'River Swim example' and running '500 Monte Carlo runs for both the examples and run for T = 10^5,' but it does not specify any training, validation, or test dataset splits. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware used to run the experiments, such as GPU/CPU models or other system specifications. |
| Software Dependencies | No | The paper mentions algorithms like 'UCRL2 [8], TSMDP [15], and Lazy PSRL [16]' and the use of an 'independent Dirichlet prior,' but it does not specify any software names with version numbers required to replicate the experiments. |
| Experiment Setup | Yes | We chose δ = 0.05 for the implementation of UCRL2 and assume an independent Dirichlet prior with parameters [0.1, . . . , 0.1] over the transition probabilities for all TS algorithms. We simulate 500 Monte Carlo runs for both the examples and run for T = 10^5. The cost function is given by: c(s, a) = 0.8 if s = 1, a = left; c(s, a) = 0 if s = 6, a = right; and c(s, a) = 1 otherwise. |