Policy Optimization for Robust Average Reward MDPs

Authors: Zhongchang Sun, Sihong He, Fei Miao, Shaofeng Zou

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We provide simulation results to demonstrate the performance of our algorithm.In this section, we provide some simulation results to demonstrate the performance of our algorithm.
Researcher Affiliation Academia Zhongchang Sun University at Buffalo zhongcha@buffalo.edu Sihong He University of Texas at Arlington sihong.he@uta.edu Fei Miao University of Connecticut fei.miao@uconn.edu Shaofeng Zou Arizona State University zou@asu.edu
Pseudocode Yes Algorithm 1 Robust Policy Mirror Descent
Open Source Code No The code will be released if the paper is accepted.
Open Datasets Yes We verify our method on one classical problem: the Garnet problem, and a robotic application problem: the recycling robot problem.More details can be found in [2].For more details, refer to [34].
Dataset Splits No The paper mentions training episodes and steps but does not specify explicit train/validation/test dataset splits.
Hardware Specification Yes The host machine used in our experiments is a server configured with AMD Ryzen Threadripper 2990WX 32-core processors and four Quadro RTX 6000 GPUs.
Software Dependencies No All experiments are performed on Python 3.8.
Experiment Setup Yes We consider the constant step size and set the step size η = 0.01, the pre-specified radius of the uncertainty set R = 0.1.Each training episode contains 2000 training steps. The length of training episodes is respectively 100 and 300 for Garnet and robot problems. We choose the uncertainty set to be the KL divergence uncertainty set.Both methods use a uniform random policy as the initialized policy.