Average-Reward Reinforcement Learning with Trust Region Methods
Authors: Xiaoteng Ma, Xiaohang Tang, Li Xia, Jun Yang, Qianchuan Zhao
IJCAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, experiments are conducted in the continuous control environment Mu Jo Co. In most tasks, APO performs better than the discounted PPO, which demonstrates the effectiveness of our approach. |
| Researcher Affiliation | Academia | Xiaoteng Ma1 , Xiaohang Tang2 , Li Xia3 , Jun Yang1 and Qianchuan Zhao1 1Department of Automation, Tsinghua University 2Department of Statistical Science, University College London 3Business School, Sun Yat-sen University |
| Pseudocode | Yes | Algorithm 1 Average Policy Optimization |
| Open Source Code | No | The paper does not explicitly state that its implementation code is open source or provide a link to a repository for its specific methodology. |
| Open Datasets | Yes | We choose the continuous control benchmark Mu Jo Co [Todorov et al., 2012] with the Open AI Gym [Brockman et al., 2016]. |
| Dataset Splits | Yes | For each task, we run the algorithm with 5 random seeds for 3 million steps and do the evaluation every 2000 steps. In the evaluation, we run 10 episodes without exploration by setting the standard deviation of policy as zero. |
| Hardware Specification | Yes | The computing infrastructure for running experiments is a server with 2 AMD EPYC 7702 64-Core Processor CPUs and 8 Nvidia Ge Force RTX 2080 Ti GPUs. |
| Software Dependencies | No | The paper mentions using 'Py Torch named rlpyt' but does not specify version numbers for PyTorch or other software dependencies. |
| Experiment Setup | Yes | All the hyperparameter combinations we consider are grid searched, which are showed in Appendix B. For each task, we run the algorithm with 5 random seeds for 3 million steps and do the evaluation every 2000 steps. |