Accelerated Policy Gradient for s-rectangular Robust MDPs with Large State Spaces

Authors: Ziyi Chen, Heng Huang

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The experiments are implemented on Python 3.9 in a Mac Book Pro laptop with 500 GB Storage and 8-core CPU (16 GB Memory). The code can be downloaded from https://github.com/changy12/ICML2024-Accelerated-P olicy-Gradient-for-s-rectangular-Robust-MDPs-with-Large-State-Spaces. O.1. Experiments on Small State Space under Deterministic Setting O.2. Experiments on Large State Space
Researcher Affiliation Academia 1Department of Computer Science, University of Maryland College Park. Correspondence to: Ziyi Chen <zc286@umd.edu>, Heng Huang <heng@umd.edu>.
Pseudocode Yes Algorithm 1 Accelerated Robust Policy Gradient; Algorithm 2 Accelerated Stochastic Robust Policy Gradient; Algorithm 3 Accelerated Stochastic Robust Policy Gradient for Large State Space
Open Source Code Yes The code can be downloaded from https://github.com/changy12/ICML2024-Accelerated-P olicy-Gradient-for-s-rectangular-Robust-MDPs-with-Large-State-Spaces.
Open Datasets Yes We compare our Algorithm 1 with the existing double-loop robust policy gradient (DRPG) algorithm (Wang et al., 2023) and actor-critic algorithm (Li et al., 2023b) under deterministic setting (i.e., when exact values of some quantities are available, including gradients, Q functions, V functions, etc.) on the Garnet problem (Archibald et al., 1995; Wang and Zou, 2022) with spaces S = {0, 1, 2, 3, 4} of 5 states and A = {0, 1, 2} of 3 actions.
Dataset Splits No The paper describes problem setups (Garnet problem with different state spaces) and experimental parameters but does not specify train/validation/test dataset splits for reproducibility.
Hardware Specification Yes The experiments are implemented on Python 3.9 in a Mac Book Pro laptop with 500 GB Storage and 8-core CPU (16 GB Memory).
Software Dependencies No The paper mentions "Python 3.9" but does not specify versions for other key software libraries or dependencies.
Experiment Setup Yes We implement an exact version of Algorithm 1 (i.e., ϵ1 = ϵ2 = 0) using Tp = 5 outer transition kernel updates with stepsize β = 0.001, and T = 1 inner policy update with stepsize η = 1 γ τ = 50 per outer update. For DRPG algorithm, we use T = 5 outer policy updates (Algorithm 1 of (Wang et al., 2023)) with stepsize αt = 10 and Tk = 1 inner transition kernel update (Algorithm 2 of (Wang et al., 2023)) with stepsize βt = 0.001 per outer update. For actor-critic algorithm (Algorithm 4.1 of (Li et al., 2023b)), we use K = 5 outer iterations, where the actor step (policy update) uses stepsize η = 500, and the critic step (transition kernel update) uses only 1 iteration of Algorithm 3.2 of (Li et al., 2023b) with αm = 1 as well as Pϵ obtained by exactly solving the direction-finding subproblem in eq. (3.4) of (Li et al., 2023b). In the above robust MDP setting with varying d {5, 20, 50, 100, 130, 140, 150}, we implement Algorithm 3 with τ = 0.1, T = 10, T = 20, T1 = 105, η = 1, β = 0.001, α = 0.001, N = 1, H = 500, N = 104.