Accelerated Policy Gradient for s-rectangular Robust MDPs with Large State Spaces
Authors: Ziyi Chen, Heng Huang
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The experiments are implemented on Python 3.9 in a Mac Book Pro laptop with 500 GB Storage and 8-core CPU (16 GB Memory). The code can be downloaded from https://github.com/changy12/ICML2024-Accelerated-P olicy-Gradient-for-s-rectangular-Robust-MDPs-with-Large-State-Spaces. O.1. Experiments on Small State Space under Deterministic Setting O.2. Experiments on Large State Space |
| Researcher Affiliation | Academia | 1Department of Computer Science, University of Maryland College Park. Correspondence to: Ziyi Chen <zc286@umd.edu>, Heng Huang <heng@umd.edu>. |
| Pseudocode | Yes | Algorithm 1 Accelerated Robust Policy Gradient; Algorithm 2 Accelerated Stochastic Robust Policy Gradient; Algorithm 3 Accelerated Stochastic Robust Policy Gradient for Large State Space |
| Open Source Code | Yes | The code can be downloaded from https://github.com/changy12/ICML2024-Accelerated-P olicy-Gradient-for-s-rectangular-Robust-MDPs-with-Large-State-Spaces. |
| Open Datasets | Yes | We compare our Algorithm 1 with the existing double-loop robust policy gradient (DRPG) algorithm (Wang et al., 2023) and actor-critic algorithm (Li et al., 2023b) under deterministic setting (i.e., when exact values of some quantities are available, including gradients, Q functions, V functions, etc.) on the Garnet problem (Archibald et al., 1995; Wang and Zou, 2022) with spaces S = {0, 1, 2, 3, 4} of 5 states and A = {0, 1, 2} of 3 actions. |
| Dataset Splits | No | The paper describes problem setups (Garnet problem with different state spaces) and experimental parameters but does not specify train/validation/test dataset splits for reproducibility. |
| Hardware Specification | Yes | The experiments are implemented on Python 3.9 in a Mac Book Pro laptop with 500 GB Storage and 8-core CPU (16 GB Memory). |
| Software Dependencies | No | The paper mentions "Python 3.9" but does not specify versions for other key software libraries or dependencies. |
| Experiment Setup | Yes | We implement an exact version of Algorithm 1 (i.e., ϵ1 = ϵ2 = 0) using Tp = 5 outer transition kernel updates with stepsize β = 0.001, and T = 1 inner policy update with stepsize η = 1 γ τ = 50 per outer update. For DRPG algorithm, we use T = 5 outer policy updates (Algorithm 1 of (Wang et al., 2023)) with stepsize αt = 10 and Tk = 1 inner transition kernel update (Algorithm 2 of (Wang et al., 2023)) with stepsize βt = 0.001 per outer update. For actor-critic algorithm (Algorithm 4.1 of (Li et al., 2023b)), we use K = 5 outer iterations, where the actor step (policy update) uses stepsize η = 500, and the critic step (transition kernel update) uses only 1 iteration of Algorithm 3.2 of (Li et al., 2023b) with αm = 1 as well as Pϵ obtained by exactly solving the direction-finding subproblem in eq. (3.4) of (Li et al., 2023b). In the above robust MDP setting with varying d {5, 20, 50, 100, 130, 140, 150}, we implement Algorithm 3 with τ = 0.1, T = 10, T = 20, T1 = 105, η = 1, β = 0.001, α = 0.001, N = 1, H = 500, N = 104. |