Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Average-Reward Reinforcement Learning with Trust Region Methods
Authors: Xiaoteng Ma, Xiaohang Tang, Li Xia, Jun Yang, Qianchuan Zhao
IJCAI 2021 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, experiments are conducted in the continuous control environment Mu Jo Co. In most tasks, APO performs better than the discounted PPO, which demonstrates the effectiveness of our approach. |
| Researcher Affiliation | Academia | Xiaoteng Ma1 , Xiaohang Tang2 , Li Xia3 , Jun Yang1 and Qianchuan Zhao1 1Department of Automation, Tsinghua University 2Department of Statistical Science, University College London 3Business School, Sun Yat-sen University |
| Pseudocode | Yes | Algorithm 1 Average Policy Optimization |
| Open Source Code | No | The paper does not explicitly state that its implementation code is open source or provide a link to a repository for its specific methodology. |
| Open Datasets | Yes | We choose the continuous control benchmark Mu Jo Co [Todorov et al., 2012] with the Open AI Gym [Brockman et al., 2016]. |
| Dataset Splits | Yes | For each task, we run the algorithm with 5 random seeds for 3 million steps and do the evaluation every 2000 steps. In the evaluation, we run 10 episodes without exploration by setting the standard deviation of policy as zero. |
| Hardware Specification | Yes | The computing infrastructure for running experiments is a server with 2 AMD EPYC 7702 64-Core Processor CPUs and 8 Nvidia Ge Force RTX 2080 Ti GPUs. |
| Software Dependencies | No | The paper mentions using 'Py Torch named rlpyt' but does not specify version numbers for PyTorch or other software dependencies. |
| Experiment Setup | Yes | All the hyperparameter combinations we consider are grid searched, which are showed in Appendix B. For each task, we run the algorithm with 5 random seeds for 3 million steps and do the evaluation every 2000 steps. |