Fast Bellman Updates for Wasserstein Distributionally Robust MDPs

Authors: Zhuodong Yu, Ling Dai, Shaohang Xu, Siyang Gao, Chin Pang Ho

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our numerical experiments demonstrate that the proposed algorithms outperform other state-of-the-art solution methods.
Researcher Affiliation Academia 1School of Data Science, City University of Hong Kong, Hong Kong 2Department of Systems Engineering, City University of Hong Kong, Hong Kong
Pseudocode Yes Algorithm 1: Nested bisection method to compute distributionally robust Bellman update (5)
Open Source Code Yes We will release our code to ensure reproducibility. https://github.com/Chill-zd/Fast-Bellman-Updates-DRMDP
Open Datasets Yes Our algorithms are tested on some random instances generated by the Generalized Average Reward Non-stationary Environment Test-bench (Garnet MDPs) (Archibald et al., 1995; Bhatnagar et al., 2009).
Dataset Splits No The paper does not explicitly provide training/test/validation dataset splits needed to reproduce the experiment.
Hardware Specification Yes All experiments are implemented in Python 3.8, and they are run on a 2.3 GHz 4-Core Intel Core i7 CPU with 32 GB 3733 MHz DDR4 main memory.
Software Dependencies Yes All experiments are implemented in Python 3.8, and they are run on a 2.3 GHz 4-Core Intel Core i7 CPU with 32 GB 3733 MHz DDR4 main memory. We compare our fast algorithm with the state-of-the-art solver Gurobi with version v10.0.1rc0 (Gurobi Optimization, LLC, 2023) and the first-order method of (Grand-Clément and Kroer, 2021a).
Experiment Setup Yes Following the same setting as (Grand-Clément and Kroer, 2021a), we set nb to be 0.2 and random uniform rewards to be in [0, 10]. The discount factor λ is fixed at 0.8, and parameter ϵ = 0.1. For each MDP instance, we generate the sampled kernels p1, . . . , p N, considering N small random (Garnet) perturbations around the nominal kernel p0. We set parameter θ in Proposition 4.1 to be nb A. To test the speed of Bellman update, we run the random instances 50 times for all the algorithms, and show the average time of them in tables 1, 2 and 3. We also compare the speed of value iteration for all algorithms using the same convergence criteria: v v 2λϵ(1 λ) 1, which follows (Grand-Clément and Kroer, 2021a).