Model-free Reinforcement Learning in Infinite-horizon Average-reward Markov Decision Processes

Authors: Chen-Yu Wei, Mehdi Jafarnia Jahromi, Haipeng Luo, Hiteshi Sharma, Rahul Jain

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, two model-free algorithms are introduced for learning infinite-horizon average-reward Markov Decision Processes (MDPs). The first algorithm reduces the problem to the discounted-reward version and achieves O(T 2/3) regret after T steps, under the minimal assumption of weakly communicating MDPs. To our knowledge, this is the first model-free algorithm for general MDPs in this setting. The second algorithm makes use of recent advances in adaptive algorithms for adversarial multi-armed bandits and improves the regret to O( T), albeit with a stronger ergodic assumption. This result significantly improves over the O(T 3/4) regret achieved by the only existing model-free algorithm by Abbasi-Yadkori et al. (2019a) for ergodic MDPs in the infinitehorizon average-reward setting. ... We also conduct experiments comparing our two algorithms. Details are deferred to Appendix D due to space constraints.
Researcher Affiliation Academia Chen-Yu Wei 1 Mehdi Jafarnia-Jahromi 1 Haipeng Luo 1 Hiteshi Sharma 1 Rahul Jain 1 1University of Southern California. Correspondence to: Chen Yu Wei <chenyu.wei@usc.edu>, Mehdi Jafarnia-Jahromi <mjafarni@usc.edu>.
Pseudocode Yes Algorithm 1 OPTIMISTIC Q-LEARNING; Algorithm 2 MDP-OOMD; Algorithm 3 ESTIMATEQ; Algorithm 4 OOMDUPDATE.
Open Source Code No The paper does not provide an explicit statement or link for the source code of its described methodology. While Appendix D is mentioned for experimental details, it does not state that code is provided there or elsewhere.
Open Datasets No The paper mentions conducting experiments, stating 'Details are deferred to Appendix D due to space constraints.' However, the main body of the paper does not specify any datasets used or provide concrete access information for them.
Dataset Splits No The paper mentions conducting experiments in Appendix D, but the main text does not provide specific details on training, validation, or test dataset splits.
Hardware Specification No The paper does not provide any specific hardware details (e.g., GPU/CPU models, memory amounts) used for running its experiments.
Software Dependencies No The paper does not provide specific ancillary software details with version numbers (e.g., library or solver names with version numbers) needed to replicate the experiment.
Experiment Setup No The paper mentions conducting experiments in Appendix D, but the main text does not provide specific experimental setup details such as hyperparameter values, training configurations, or system-level settings.