Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Learning Unknown Markov Decision Processes: A Thompson Sampling Approach
Authors: Yi Ouyang, Mukul Gagrani, Ashutosh Nayyar, Rahul Jain
NeurIPS 2017 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Numerical results show it to perform better than existing algorithms for in๏ฌnite horizon MDPs. |
| Researcher Affiliation | Academia | Yi Ouyang University of California, Berkeley EMAIL Mukul Gagrani University of Southern California EMAIL Ashutosh Nayyar University of Southern California EMAIL Rahul Jain University of Southern California EMAIL |
| Pseudocode | Yes | Algorithm 1 Thompson Sampling with Dynamic Episodes (TSDE) |
| Open Source Code | No | The paper does not provide any statement or link indicating that the source code for the described methodology is publicly available. |
| Open Datasets | Yes | We consider two environments: randomly generated MDPs and the River Swim example [22]. [22] A. L. Strehl and M. L. Littman, An analysis of model-based interval estimation for markov decision processes, Journal of Computer and System Sciences, vol. 74, no. 8, pp. 1309 1331, 2008. |
| Dataset Splits | No | The paper mentions using 'randomly generated MDPs' and the 'River Swim example' and running '500 Monte Carlo runs for both the examples and run for T = 10^5,' but it does not specify any training, validation, or test dataset splits. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware used to run the experiments, such as GPU/CPU models or other system specifications. |
| Software Dependencies | No | The paper mentions algorithms like 'UCRL2 [8], TSMDP [15], and Lazy PSRL [16]' and the use of an 'independent Dirichlet prior,' but it does not specify any software names with version numbers required to replicate the experiments. |
| Experiment Setup | Yes | We chose ฮด = 0.05 for the implementation of UCRL2 and assume an independent Dirichlet prior with parameters [0.1, . . . , 0.1] over the transition probabilities for all TS algorithms. We simulate 500 Monte Carlo runs for both the examples and run for T = 10^5. The cost function is given by: c(s, a) = 0.8 if s = 1, a = left; c(s, a) = 0 if s = 6, a = right; and c(s, a) = 1 otherwise. |