reproducibilityindex.ai

Robust Anytime Learning of Markov Decision Processes

Authors: Marnix Suilen, Thiago D. Simão, David Parker, Nils Jansen

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show the effectiveness of our approach and compare it to robust policies computed on u MDPs learned by the UCRL2 reinforcement learning algorithm in an experimental evaluation on several benchmarks.
Researcher Affiliation	Academia	Marnix Suilen Department of Software Science Radboud University Nijmegen, The Netherlands Thiago D. Simão Department of Software Science Radboud University Nijmegen, The Netherlands David Parker Department of Computer Science University of Oxford Oxford, United Kingdom Nils Jansen Department of Software Science Radboud University Nijmegen, The Netherlands
Pseudocode	No	The paper describes its methods in prose and mathematical formulations but does not include any structured pseudocode or algorithm blocks.
Open Source Code	Yes	The implementation is available at https://github.com/LAVA-LAB/luiaard. See supplementary material for the code and instructions on how to reproduce the results in the paper.
Open Datasets	Yes	We benchmark our method using several well-known environments: the Chain Problem [Araya-López et al., 2011], Aircraft Collision Avoidance [Kochenderfer, 2015], a slippery Grid World [Derman et al., 2019], a 99-armed Bandit [Lattimore and Szepesvári, 2020], and two version of a Betting Game [Bäuerle and Ott, 2011].
Dataset Splits	No	The paper describes an iterative learning process where the model is updated with new data and policies are computed on the current learned model, but it does not specify a separate 'validation' dataset split in the traditional sense of training, validation, and test sets.
Hardware Specification	Yes	All experiments were performed on a machine with a 4GHz Intel Core i9 CPU, using a single core.
Software Dependencies	No	The paper states, 'We implement our approach... in Java on top of the verification tool PRISM,' but does not specify version numbers for Java or PRISM, nor any other key software dependencies with their versions.
Experiment Setup	Yes	We set ε = 1e-4 as constant and define the prior u MDP with intervals Pi = [ε, 1 ε] and strength intervals [ni, ni] = [5, 10] at every transition P(s, a, si), as in Figure 2c. For MAP, we use a prior of αi = 10 for all i. The same prior is used for the point estimates of both PAC and UCRL2, together with an error rate of γ = 0.01. We introduce a hyperparameter ξ [0, 1], and follow with probability ξ the action of the optimistic policy, and distribute the remaining 1 ξ uniformly over the other actions, yielding a memoryless randomized policy.