Model-free Reinforcement Learning in Infinite-horizon Average-reward Markov Decision Processes
Authors: Chen-Yu Wei, Mehdi Jafarnia Jahromi, Haipeng Luo, Hiteshi Sharma, Rahul Jain
ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, two model-free algorithms are introduced for learning infinite-horizon average-reward Markov Decision Processes (MDPs). The first algorithm reduces the problem to the discounted-reward version and achieves O(T 2/3) regret after T steps, under the minimal assumption of weakly communicating MDPs. To our knowledge, this is the first model-free algorithm for general MDPs in this setting. The second algorithm makes use of recent advances in adaptive algorithms for adversarial multi-armed bandits and improves the regret to O( T), albeit with a stronger ergodic assumption. This result significantly improves over the O(T 3/4) regret achieved by the only existing model-free algorithm by Abbasi-Yadkori et al. (2019a) for ergodic MDPs in the infinitehorizon average-reward setting. ... We also conduct experiments comparing our two algorithms. Details are deferred to Appendix D due to space constraints. |
| Researcher Affiliation | Academia | Chen-Yu Wei 1 Mehdi Jafarnia-Jahromi 1 Haipeng Luo 1 Hiteshi Sharma 1 Rahul Jain 1 1University of Southern California. Correspondence to: Chen Yu Wei <chenyu.wei@usc.edu>, Mehdi Jafarnia-Jahromi <mjafarni@usc.edu>. |
| Pseudocode | Yes | Algorithm 1 OPTIMISTIC Q-LEARNING; Algorithm 2 MDP-OOMD; Algorithm 3 ESTIMATEQ; Algorithm 4 OOMDUPDATE. |
| Open Source Code | No | The paper does not provide an explicit statement or link for the source code of its described methodology. While Appendix D is mentioned for experimental details, it does not state that code is provided there or elsewhere. |
| Open Datasets | No | The paper mentions conducting experiments, stating 'Details are deferred to Appendix D due to space constraints.' However, the main body of the paper does not specify any datasets used or provide concrete access information for them. |
| Dataset Splits | No | The paper mentions conducting experiments in Appendix D, but the main text does not provide specific details on training, validation, or test dataset splits. |
| Hardware Specification | No | The paper does not provide any specific hardware details (e.g., GPU/CPU models, memory amounts) used for running its experiments. |
| Software Dependencies | No | The paper does not provide specific ancillary software details with version numbers (e.g., library or solver names with version numbers) needed to replicate the experiment. |
| Experiment Setup | No | The paper mentions conducting experiments in Appendix D, but the main text does not provide specific experimental setup details such as hyperparameter values, training configurations, or system-level settings. |