Solving Long-run Average Reward Robust MDPs via Stochastic Games

Authors: Krishnendu Chatterjee, Ehsan Kafshdar Goharshady, Mehrdad Karrabi, Petr Novotný, Đorđe Žikelić

IJCAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We implement RPPI (Algorithm 1) and compare it against two state of the art value iteration-based methods for solving long-run average reward RMDPs with uncertainty sets not being intervals or L1-balls, towards demonstrating the significant computational runtime gains provided by a policy iteration-based algorithm. Furthermore, we demonstrate the applicability of our method to non-unichain polytopic RMDPs to which existing algorithms are not applicable. Our implementation is publicly available at https://github.com/ mehrdad76/RMDP-LRA.
Researcher Affiliation Academia 1Institute of Science and Technology Austria (ISTA), Austria 2Masaryk University, Czech Republic 3Singapore Management University, Singapore
Pseudocode Yes Algorithm 1 Robust Polytopic Policy Iteration (RPPI)
Open Source Code Yes Our implementation is publicly available at https://github.com/ mehrdad76/RMDP-LRA.
Open Datasets Yes We consider a Contamination Model taken from [Wang et al., 2023]. ... This benchmark modifies the Frozen Lake environment in the Open AI Gym [Towers et al., 2023] to turn it into an RMDP.
Dataset Splits No The paper describes the benchmarks and models used but does not specify training, validation, or test dataset splits (e.g., by percentages or counts).
Hardware Specification Yes All experiments were run in Python 3.9 on a Ubuntu 22.04 machine with an octa-core 2.40 GHz Intel Core i5 CPU, 16 GB RAM.
Software Dependencies No The paper mentions 'Python 3.9' and using 'Storm' (a probabilistic model checker). While Python 3.9 specifies a version, it does not provide version numbers for any libraries, frameworks, or for Storm itself, which is required for comprehensive reproducibility.
Experiment Setup Yes In our comparison, we run them both until the absolute difference between their computed values and our computed value is at most 10 3. ... In our evaluation, we consider a Contamination Model taken from [Wang et al., 2023]. ... We turn this model into a polytopic RMDP by allowing the adversarial environment to perturb the transition probabilities by increasing the probability of moving to one adjacent cell by at most d = 0.2 ... Due to numerical precision issues in Python and Storm, in PPE we replace the equality by |max_rewards min_rewards| < 10 5.