Robust Anytime Learning of Markov Decision Processes
Authors: Marnix Suilen, Thiago D. Simão, David Parker, Nils Jansen
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show the effectiveness of our approach and compare it to robust policies computed on u MDPs learned by the UCRL2 reinforcement learning algorithm in an experimental evaluation on several benchmarks. |
| Researcher Affiliation | Academia | Marnix Suilen Department of Software Science Radboud University Nijmegen, The Netherlands Thiago D. Simão Department of Software Science Radboud University Nijmegen, The Netherlands David Parker Department of Computer Science University of Oxford Oxford, United Kingdom Nils Jansen Department of Software Science Radboud University Nijmegen, The Netherlands |
| Pseudocode | No | The paper describes its methods in prose and mathematical formulations but does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The implementation is available at https://github.com/LAVA-LAB/luiaard. See supplementary material for the code and instructions on how to reproduce the results in the paper. |
| Open Datasets | Yes | We benchmark our method using several well-known environments: the Chain Problem [Araya-López et al., 2011], Aircraft Collision Avoidance [Kochenderfer, 2015], a slippery Grid World [Derman et al., 2019], a 99-armed Bandit [Lattimore and Szepesvári, 2020], and two version of a Betting Game [Bäuerle and Ott, 2011]. |
| Dataset Splits | No | The paper describes an iterative learning process where the model is updated with new data and policies are computed on the current learned model, but it does not specify a separate 'validation' dataset split in the traditional sense of training, validation, and test sets. |
| Hardware Specification | Yes | All experiments were performed on a machine with a 4GHz Intel Core i9 CPU, using a single core. |
| Software Dependencies | No | The paper states, 'We implement our approach... in Java on top of the verification tool PRISM,' but does not specify version numbers for Java or PRISM, nor any other key software dependencies with their versions. |
| Experiment Setup | Yes | We set ε = 1e-4 as constant and define the prior u MDP with intervals Pi = [ε, 1 ε] and strength intervals [ni, ni] = [5, 10] at every transition P(s, a, si), as in Figure 2c. For MAP, we use a prior of αi = 10 for all i. The same prior is used for the point estimates of both PAC and UCRL2, together with an error rate of γ = 0.01. We introduce a hyperparameter ξ [0, 1], and follow with probability ξ the action of the optimistic policy, and distribute the remaining 1 ξ uniformly over the other actions, yielding a memoryless randomized policy. |