A Single-Loop Robust Policy Gradient Method for Robust Markov Decision Processes

Authors: Zhenwei Lin, Chenyu Xue, Qi Deng, Yinyu Ye

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Numerical experiments validate the efficacy of SRPG, demonstrating its faster and more robust convergence behavior compared to its nested-loop counterpart. (Abstract) and We conduct several experiments to investigate the performance of SRPG compared with DRPG (Wang et al., 2023). In particular, we consider two different problems, including GARNET MDPs and an inventory management problem. (Section 5).
Researcher Affiliation Academia 1Shanghai University of Finance and Economics 2Antai College of Economics and Management, Shanghai Jiao Tong University 3Stanford University. Correspondence to: Qi Deng <qdeng24@sjtu.edu.cn>.
Pseudocode Yes Algorithm 1 Single-loop Robust Policy Gradient Method
Open Source Code Yes We provide the code in this link.
Open Datasets No The paper mentions using GARNET MDPs, which are generated, and an inventory management problem, but does not provide specific access information (link, DOI, citation for a public instance) for the datasets used in their experiments. For example, 'We randomly generate the nominal transition kernel p according to two different GARNET MDPs: GARNET(5, 6, 3) and GARNET(10, 5, 10).' (Section 5.1).
Dataset Splits No The paper does not provide specific dataset split information (e.g., percentages or sample counts for training, validation, and test sets) for the experiments conducted.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU models, CPU models, or memory) used for running its experiments. It only mentions the use of 'the state-of-the-art commercial solver GUROBI' (Section 5).
Software Dependencies No The paper mentions using 'the state-of-the-art commercial solver GUROBI (Gurobi Optimization, LLC, 2023)', but it does not specify a version number for GUROBI or any other software dependencies.
Experiment Setup Yes We let the discount factor γ = 0.95, and sample the cost csas i.i.d. from the uniform distribution supported on [0, 5]. ... We choose the primal stepsize τ and dual stepsize σ from 0.01, 0.05, 0.1. We also choose the extrapolation parameters β and µ from 0.01, 0.05, 0.1, 0.2, 0.4 for SRPG. For DRPG, we also tune its primal and dual stepsize from 0.01, 0.05, 0.1. (Section 5.1).