reproducibilityindex.ai

Average-Reward Reinforcement Learning with Trust Region Methods

Authors: Xiaoteng Ma, Xiaohang Tang, Li Xia, Jun Yang, Qianchuan Zhao

IJCAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Finally, experiments are conducted in the continuous control environment Mu Jo Co. In most tasks, APO performs better than the discounted PPO, which demonstrates the effectiveness of our approach.
Researcher Affiliation	Academia	Xiaoteng Ma1 , Xiaohang Tang2 , Li Xia3 , Jun Yang1 and Qianchuan Zhao1 1Department of Automation, Tsinghua University 2Department of Statistical Science, University College London 3Business School, Sun Yat-sen University
Pseudocode	Yes	Algorithm 1 Average Policy Optimization
Open Source Code	No	The paper does not explicitly state that its implementation code is open source or provide a link to a repository for its specific methodology.
Open Datasets	Yes	We choose the continuous control benchmark Mu Jo Co [Todorov et al., 2012] with the Open AI Gym [Brockman et al., 2016].
Dataset Splits	Yes	For each task, we run the algorithm with 5 random seeds for 3 million steps and do the evaluation every 2000 steps. In the evaluation, we run 10 episodes without exploration by setting the standard deviation of policy as zero.
Hardware Specification	Yes	The computing infrastructure for running experiments is a server with 2 AMD EPYC 7702 64-Core Processor CPUs and 8 Nvidia Ge Force RTX 2080 Ti GPUs.
Software Dependencies	No	The paper mentions using 'Py Torch named rlpyt' but does not specify version numbers for PyTorch or other software dependencies.
Experiment Setup	Yes	All the hyperparameter combinations we consider are grid searched, which are showed in Appendix B. For each task, we run the algorithm with 5 random seeds for 3 million steps and do the evaluation every 2000 steps.