Average-Reward Reinforcement Learning with Trust Region Methods

Authors: Xiaoteng Ma, Xiaohang Tang, Li Xia, Jun Yang, Qianchuan Zhao

IJCAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finally, experiments are conducted in the continuous control environment Mu Jo Co. In most tasks, APO performs better than the discounted PPO, which demonstrates the effectiveness of our approach.
Researcher Affiliation Academia Xiaoteng Ma1 , Xiaohang Tang2 , Li Xia3 , Jun Yang1 and Qianchuan Zhao1 1Department of Automation, Tsinghua University 2Department of Statistical Science, University College London 3Business School, Sun Yat-sen University
Pseudocode Yes Algorithm 1 Average Policy Optimization
Open Source Code No The paper does not explicitly state that its implementation code is open source or provide a link to a repository for its specific methodology.
Open Datasets Yes We choose the continuous control benchmark Mu Jo Co [Todorov et al., 2012] with the Open AI Gym [Brockman et al., 2016].
Dataset Splits Yes For each task, we run the algorithm with 5 random seeds for 3 million steps and do the evaluation every 2000 steps. In the evaluation, we run 10 episodes without exploration by setting the standard deviation of policy as zero.
Hardware Specification Yes The computing infrastructure for running experiments is a server with 2 AMD EPYC 7702 64-Core Processor CPUs and 8 Nvidia Ge Force RTX 2080 Ti GPUs.
Software Dependencies No The paper mentions using 'Py Torch named rlpyt' but does not specify version numbers for PyTorch or other software dependencies.
Experiment Setup Yes All the hyperparameter combinations we consider are grid searched, which are showed in Appendix B. For each task, we run the algorithm with 5 random seeds for 3 million steps and do the evaluation every 2000 steps.