Fast Bellman Updates for Wasserstein Distributionally Robust MDPs
Authors: Zhuodong Yu, Ling Dai, Shaohang Xu, Siyang Gao, Chin Pang Ho
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our numerical experiments demonstrate that the proposed algorithms outperform other state-of-the-art solution methods. |
| Researcher Affiliation | Academia | 1School of Data Science, City University of Hong Kong, Hong Kong 2Department of Systems Engineering, City University of Hong Kong, Hong Kong |
| Pseudocode | Yes | Algorithm 1: Nested bisection method to compute distributionally robust Bellman update (5) |
| Open Source Code | Yes | We will release our code to ensure reproducibility. https://github.com/Chill-zd/Fast-Bellman-Updates-DRMDP |
| Open Datasets | Yes | Our algorithms are tested on some random instances generated by the Generalized Average Reward Non-stationary Environment Test-bench (Garnet MDPs) (Archibald et al., 1995; Bhatnagar et al., 2009). |
| Dataset Splits | No | The paper does not explicitly provide training/test/validation dataset splits needed to reproduce the experiment. |
| Hardware Specification | Yes | All experiments are implemented in Python 3.8, and they are run on a 2.3 GHz 4-Core Intel Core i7 CPU with 32 GB 3733 MHz DDR4 main memory. |
| Software Dependencies | Yes | All experiments are implemented in Python 3.8, and they are run on a 2.3 GHz 4-Core Intel Core i7 CPU with 32 GB 3733 MHz DDR4 main memory. We compare our fast algorithm with the state-of-the-art solver Gurobi with version v10.0.1rc0 (Gurobi Optimization, LLC, 2023) and the first-order method of (Grand-Clément and Kroer, 2021a). |
| Experiment Setup | Yes | Following the same setting as (Grand-Clément and Kroer, 2021a), we set nb to be 0.2 and random uniform rewards to be in [0, 10]. The discount factor λ is fixed at 0.8, and parameter ϵ = 0.1. For each MDP instance, we generate the sampled kernels p1, . . . , p N, considering N small random (Garnet) perturbations around the nominal kernel p0. We set parameter θ in Proposition 4.1 to be nb A. To test the speed of Bellman update, we run the random instances 50 times for all the algorithms, and show the average time of them in tables 1, 2 and 3. We also compare the speed of value iteration for all algorithms using the same convergence criteria: v v 2λϵ(1 λ) 1, which follows (Grand-Clément and Kroer, 2021a). |