Solving Long-run Average Reward Robust MDPs via Stochastic Games
Authors: Krishnendu Chatterjee, Ehsan Kafshdar Goharshady, Mehrdad Karrabi, Petr Novotný, Đorđe Žikelić
IJCAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We implement RPPI (Algorithm 1) and compare it against two state of the art value iteration-based methods for solving long-run average reward RMDPs with uncertainty sets not being intervals or L1-balls, towards demonstrating the significant computational runtime gains provided by a policy iteration-based algorithm. Furthermore, we demonstrate the applicability of our method to non-unichain polytopic RMDPs to which existing algorithms are not applicable. Our implementation is publicly available at https://github.com/ mehrdad76/RMDP-LRA. |
| Researcher Affiliation | Academia | 1Institute of Science and Technology Austria (ISTA), Austria 2Masaryk University, Czech Republic 3Singapore Management University, Singapore |
| Pseudocode | Yes | Algorithm 1 Robust Polytopic Policy Iteration (RPPI) |
| Open Source Code | Yes | Our implementation is publicly available at https://github.com/ mehrdad76/RMDP-LRA. |
| Open Datasets | Yes | We consider a Contamination Model taken from [Wang et al., 2023]. ... This benchmark modifies the Frozen Lake environment in the Open AI Gym [Towers et al., 2023] to turn it into an RMDP. |
| Dataset Splits | No | The paper describes the benchmarks and models used but does not specify training, validation, or test dataset splits (e.g., by percentages or counts). |
| Hardware Specification | Yes | All experiments were run in Python 3.9 on a Ubuntu 22.04 machine with an octa-core 2.40 GHz Intel Core i5 CPU, 16 GB RAM. |
| Software Dependencies | No | The paper mentions 'Python 3.9' and using 'Storm' (a probabilistic model checker). While Python 3.9 specifies a version, it does not provide version numbers for any libraries, frameworks, or for Storm itself, which is required for comprehensive reproducibility. |
| Experiment Setup | Yes | In our comparison, we run them both until the absolute difference between their computed values and our computed value is at most 10 3. ... In our evaluation, we consider a Contamination Model taken from [Wang et al., 2023]. ... We turn this model into a polytopic RMDP by allowing the adversarial environment to perturb the transition probabilities by increasing the probability of moving to one adjacent cell by at most d = 0.2 ... Due to numerical precision issues in Python and Storm, in PPE we replace the equality by |max_rewards min_rewards| < 10 5. |