A Single-Loop Robust Policy Gradient Method for Robust Markov Decision Processes
Authors: Zhenwei Lin, Chenyu Xue, Qi Deng, Yinyu Ye
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Numerical experiments validate the efficacy of SRPG, demonstrating its faster and more robust convergence behavior compared to its nested-loop counterpart. (Abstract) and We conduct several experiments to investigate the performance of SRPG compared with DRPG (Wang et al., 2023). In particular, we consider two different problems, including GARNET MDPs and an inventory management problem. (Section 5). |
| Researcher Affiliation | Academia | 1Shanghai University of Finance and Economics 2Antai College of Economics and Management, Shanghai Jiao Tong University 3Stanford University. Correspondence to: Qi Deng <qdeng24@sjtu.edu.cn>. |
| Pseudocode | Yes | Algorithm 1 Single-loop Robust Policy Gradient Method |
| Open Source Code | Yes | We provide the code in this link. |
| Open Datasets | No | The paper mentions using GARNET MDPs, which are generated, and an inventory management problem, but does not provide specific access information (link, DOI, citation for a public instance) for the datasets used in their experiments. For example, 'We randomly generate the nominal transition kernel p according to two different GARNET MDPs: GARNET(5, 6, 3) and GARNET(10, 5, 10).' (Section 5.1). |
| Dataset Splits | No | The paper does not provide specific dataset split information (e.g., percentages or sample counts for training, validation, and test sets) for the experiments conducted. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU models, CPU models, or memory) used for running its experiments. It only mentions the use of 'the state-of-the-art commercial solver GUROBI' (Section 5). |
| Software Dependencies | No | The paper mentions using 'the state-of-the-art commercial solver GUROBI (Gurobi Optimization, LLC, 2023)', but it does not specify a version number for GUROBI or any other software dependencies. |
| Experiment Setup | Yes | We let the discount factor γ = 0.95, and sample the cost csas i.i.d. from the uniform distribution supported on [0, 5]. ... We choose the primal stepsize τ and dual stepsize σ from 0.01, 0.05, 0.1. We also choose the extrapolation parameters β and µ from 0.01, 0.05, 0.1, 0.2, 0.4 for SRPG. For DRPG, we also tune its primal and dual stepsize from 0.01, 0.05, 0.1. (Section 5.1). |