Robust Anytime Learning of Markov Decision Processes

Authors: Marnix Suilen, Thiago D. Simão, David Parker, Nils Jansen

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show the effectiveness of our approach and compare it to robust policies computed on u MDPs learned by the UCRL2 reinforcement learning algorithm in an experimental evaluation on several benchmarks.
Researcher Affiliation Academia Marnix Suilen Department of Software Science Radboud University Nijmegen, The Netherlands Thiago D. Simão Department of Software Science Radboud University Nijmegen, The Netherlands David Parker Department of Computer Science University of Oxford Oxford, United Kingdom Nils Jansen Department of Software Science Radboud University Nijmegen, The Netherlands
Pseudocode No The paper describes its methods in prose and mathematical formulations but does not include any structured pseudocode or algorithm blocks.
Open Source Code Yes The implementation is available at https://github.com/LAVA-LAB/luiaard. See supplementary material for the code and instructions on how to reproduce the results in the paper.
Open Datasets Yes We benchmark our method using several well-known environments: the Chain Problem [Araya-López et al., 2011], Aircraft Collision Avoidance [Kochenderfer, 2015], a slippery Grid World [Derman et al., 2019], a 99-armed Bandit [Lattimore and Szepesvári, 2020], and two version of a Betting Game [Bäuerle and Ott, 2011].
Dataset Splits No The paper describes an iterative learning process where the model is updated with new data and policies are computed on the current learned model, but it does not specify a separate 'validation' dataset split in the traditional sense of training, validation, and test sets.
Hardware Specification Yes All experiments were performed on a machine with a 4GHz Intel Core i9 CPU, using a single core.
Software Dependencies No The paper states, 'We implement our approach... in Java on top of the verification tool PRISM,' but does not specify version numbers for Java or PRISM, nor any other key software dependencies with their versions.
Experiment Setup Yes We set ε = 1e-4 as constant and define the prior u MDP with intervals Pi = [ε, 1 ε] and strength intervals [ni, ni] = [5, 10] at every transition P(s, a, si), as in Figure 2c. For MAP, we use a prior of αi = 10 for all i. The same prior is used for the point estimates of both PAC and UCRL2, together with an error rate of γ = 0.01. We introduce a hyperparameter ξ [0, 1], and follow with probability ξ the action of the optimistic policy, and distribute the remaining 1 ξ uniformly over the other actions, yielding a memoryless randomized policy.