Policy Optimization for Robust Average Reward MDPs
Authors: Zhongchang Sun, Sihong He, Fei Miao, Shaofeng Zou
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We provide simulation results to demonstrate the performance of our algorithm.In this section, we provide some simulation results to demonstrate the performance of our algorithm. |
| Researcher Affiliation | Academia | Zhongchang Sun University at Buffalo zhongcha@buffalo.edu Sihong He University of Texas at Arlington sihong.he@uta.edu Fei Miao University of Connecticut fei.miao@uconn.edu Shaofeng Zou Arizona State University zou@asu.edu |
| Pseudocode | Yes | Algorithm 1 Robust Policy Mirror Descent |
| Open Source Code | No | The code will be released if the paper is accepted. |
| Open Datasets | Yes | We verify our method on one classical problem: the Garnet problem, and a robotic application problem: the recycling robot problem.More details can be found in [2].For more details, refer to [34]. |
| Dataset Splits | No | The paper mentions training episodes and steps but does not specify explicit train/validation/test dataset splits. |
| Hardware Specification | Yes | The host machine used in our experiments is a server configured with AMD Ryzen Threadripper 2990WX 32-core processors and four Quadro RTX 6000 GPUs. |
| Software Dependencies | No | All experiments are performed on Python 3.8. |
| Experiment Setup | Yes | We consider the constant step size and set the step size η = 0.01, the pre-specified radius of the uncertainty set R = 0.1.Each training episode contains 2000 training steps. The length of training episodes is respectively 100 and 300 for Garnet and robot problems. We choose the uncertainty set to be the KL divergence uncertainty set.Both methods use a uniform random policy as the initialized policy. |